Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Cognitive Modeling to Typological Universa...

From Cognitive Modeling to Typological Universals: Investigations with (Large) Language Models

Talk at ETH Zurich on 2024/5/28

tatsuki kuribayashi

May 28, 2024
Tweet

More Decks by tatsuki kuribayashi

Other Decks in Research

Transcript

  1. From Cognitive Modeling to Typological Universals: Investigations with (Large) Language

    Models Tatsuki Kuribayashi (MBZUAI) @Rycolab, ETH Zürich 1
  2. Hello! • Tatsuki Kuribayashi • Postdoc in MBZUAI, UAE (2023-)

    • Rapidly growing international NLP team • Ph.D. (information science) received in Tohoku University, Japan. • Third established national university in Japan • Japanese ∩ UAE residents ∩ cogsci-related researcher • Why here today? • We ate dorian w/ Alex, Ethan, Mario in EMNLP 2023 • I’ve been purely interested in the ETH community, which publicizes high quality works in my relevant cognitive modeling fields! Swizerland-inspired park! (closed in 2001) Tokyo Kyoto Deepest lake in Japan https://daiou- print.sakura.ne.jp/myhp /travel/diary/1999travel /akita/suisu-mura.html 2 Tohoku area
  3. • Cognitively-motivated natural language processing (NLP) research • Lower perplexity

    is not always human-like [Kuribayashi+, ACL2021] • Context limitations make neural language models more human-like [Kuribayash+, EMNLP2022] • Second language acquisition of neural language models [Oba+, ACL2023 Findings] • Psychometric predictive power of large language models [Kuribayashi+, NAACL2024 Findings] • Emeregent word order universals from cognitively-motivated language models [Kuribayashi+, ACL2024] • Automated writing assistance • Developing human-machine collaborative writing tool [Ito+*, INLG2019][Ito+*, EMNLP2019 demo][Ito+, UIST2023] • Parsing argumentative texts [Kuribayashi+, ACL2019] • Japanese-focused language (model) analysis • Preferences towards word-order flexibility [Kuribayashi+, ACL2020], topicalization [Fujihara+, COLING2022], ellipsis [Ishiduki+,LREC-COLING2024] of langauge models and humans • Mechanistic interpretability of Transformer langauge models (LMs) [Kobayashi+,EMNLP2020][Kobayashi+,EMNLP2021][Kobayashi+,ACL2023 Finding][Kobayashi+,ICLR2024 (spotlight)] Research background 3 One organizer of CMCL (Conitive Moldeing and Compuational Linguistics workshop) 2024 @ACL2024: 🚨 Deadline extended to 5/31 My webpage to see more
  4. Outline • Self introduction (5min.) • General introduction and background

    in cognitive modeling (15min.) • Accurate LMs deviate from humans through the lens of cognitive modeling (5min.) • Lower perplexity is not always human-like [Kuribayashi+, ACL2021] • Why? (5min.) • Context limitations make neural language models more human-like [Kuribayash+, EMNLP2022] • Is this mismatch addressed by recent efforts in human-LLM alignment? ---No (15 min.) • Psychometric predictive power of large language models [Kuribayashi+, NAACL2024 Findings] • We have relatively human-like LMs (it’s not LLM though); how can one leverage them to answer broader linguistic questions? (15 min.) • Case study: Emeregent word order universals from cognitively-motivated language models [Kuribayashi+, ACL2024] • Total ~60 min. 4
  5. • Classical NLP offered explicit implementation and quantitative evaluation for

    linguistic theory • c.f. language analysis based on introspection • What have recent neural (large) language models offered to linguistics? NLP for linguistics in the era of neural LMs 5 NLP Linguistics Task, guideline, evaluation… Tool, resource…
  6. Direction 1: measuring language with NLP • Better explain what

    characteristics language data have • NLP has offered quantitative measurement of language e.g., word similarity, sentence probability • Neural LMs accurately estimate information-theoretic values, e.g., surprisal, entropy • -> revisit information-theoretic science of language with modern LMs! 6 Former part of this talk
  7. Direction 2: simuating the development/emergence of language • Some (causal,

    diachronic, acquisition) questions are not answerable by just measuring texts at hand • Under what condition does a particular linguistic phenomena/universal arise? • What makes langauge acquisition more effecient? • Which aspect of language should be innate? • Yet, human experiments are inherently difficult • Ethical/technical problem for ablation tests [Warstadt&Bowman, 22] 7 Latter part of this talk
  8. Direction 2: simuating the development/emergence of language • Some (causal,

    diachronic, acquisition) questions are not answerable by just measuring texts at hand • Under what condition does a particular linguistic phenomena/universal arise? • What makes langauge acquisition more effecient? • Which aspect of language should be innate? • Yet, human experiments are inherently difficult • Ethical/technical problem for ablation tests [Warstadt&Bowman, 22] • Computational simulation to demonstrate proof-of-concept for linguistic hypothesis • Language model is not human, but they share some common properties, such as expectation-based sentence processing 8 Latter part of this talk Can complex organic compaounds emerge from inorganic precursors? [Miller--Urey experiment] https://en.wikipedia.org/wiki/Miller %E2%80%93Urey_experiment
  9. Background 1/5: human sentence processing • What do humans compute

    during reading and how (ultimately, why)? • Function 𝑓"(#) is human like if 𝑓"(𝒘) simlulates reading behavior 𝒚 well. • What is 𝑓? (what quantity is computed?) • What is 𝜃? (how is it computed?) 10 If you were to journey to the North of England, … If you were to journey to the North of Cognitive load humans Tokens: 𝒘 = {𝑤# … 𝑤$ } Cognitive load: 𝒚 = {𝑦# … 𝑦$} 𝒚 = 𝑓 (𝒘) e.g., longer reading time indicates a higher cognitive load
  10. • Processing cost of a word correlates with its logarithmic

    probability in context • Unpredictable words incur more cognitive loads • The relationship should be logarithmic theoretically [Smith&Levy,13] and empirically [Shain+,22] • 𝑝 word = 𝑝! ×𝑝" × ⋯×𝑝# • 𝑘 goes to infinity (super-incrementality) • cost(word)=c 𝑝! + 𝑐 𝑝" + ⋯ + 𝑐 𝑝# where c(𝑥) should be linear around x=1 Background 2/5: Surprisal theory 𝑓# (𝒘) = − log$ 𝑝(𝑤# |𝒘%𝒕 ) 11 Although my friends left the party I enjoyed Although my friends left the party continues to… My hobby is reading a book, and… My hobby is reading a music sheet, and… ❗ ❗ Then, cost(word) approaches to − log 𝑝(word) regardless of the choice of 𝑐 If [Levy,08][Smith&Levy,13][Shain+,22] 𝑉: vocabulary set 𝒘 ∈ 𝑉∗: word sequence 𝑤" : 𝑡-th word in 𝒘 𝒘#" : words before 𝑡 in 𝒘
  11. • The next questio: how do humans compute surprisal? −

    log% 𝑝" 𝑤& 𝒘'𝒕 (what is 𝜃?) • Approximating surprisal: • N-gram LM: − log! 𝑝" 𝑤#|𝒘𝟏:𝒕'𝟏 = − log! 𝑝 𝑤#|𝒘𝒕'𝑵)𝟏:𝒕'𝟏 = *(𝒘𝒕%𝑵'𝟏:𝒕) *(𝒘𝒕%𝑵'𝟏:𝒕%𝟏) • Neural LM: − log! 𝑝" 𝑤#|𝒘𝟏:𝒕'𝟏 = − log! softmax(𝑾𝒉𝒕)[id(𝑤#)] • Incremental parser: − log! 𝑝" 𝑤# 𝒘𝟏:𝒕'𝟏 = − log! . 𝒘𝟏;𝒕 . 𝒘𝟏:𝒕%𝟏 = − log! ∑ 01234 ∑ 531643 = log! ∑ 531643 ∑ 01234 • … • Humans: − log! 𝑝 𝑤#|𝒘𝟏:𝒕'𝟏 = ??? Background 3/5: surprisal is the objective, but how should it be computed…? 12 𝐶 " : count 𝑊 ∈ ℝ|"|×$: embed. Matrix ℎ% ∈ ℝ$: hidden state of time 𝑡 Id("): returns word index in vocabulary compatible derivations [Hale, 16] [Hale, 16]
  12. Background 4/5: what models can compute human-like surprisal? • Even

    within neural LMs, there are various 𝜃s: • Linear/hierarchical model [Frank&Bob,11]… • Simple RNN vs. PCFG estimation • Lexicalized/unelxicalized model [Fossum&Levy,12]… • PoS-based vs. word-based estimation • RNN?LSTM?Transformer? [Aurnhammer&Frank,19] [Wilcox+,20] [Merkx&Frank,21] • LM quality (perplexity; PPL)? [Frank&Bob,11] [Fossum&Levy,12] [Goodkind&Bicknell,18] 13 [Goodkind&Bicknell,18] [Merkx&Frank,21] better better Reading_time (word) ~ surprisal (word) + baseline_factors(word) Reading_time (word) ~ baseline_factors(word) Psychometric predictive power (PPP) 𝑃𝑃𝐿 = - !"# $ 𝑝(𝑤! |𝒘%𝒕 )#/$
  13. Background 5/5: Linguistically accurate LMs are not always cognitively accurate

    • Large (>GPT2-small) LMs poorly explain human reading behavior • Language-dependent results (next slide) [Kuribayashi+,21] • Even in English, way large LMs (e.g., OPT, GPT2-Neo, GPT-3) are less human-like [Oh&Schuler,23][Shain+,23] 14 💡Scaling does hold cross-lingually at least when using a small (6layers) Transformer and varying the training data size [Wilcox+,23] [Kuribayashi+,21] [Oh&Schuler,23] Better PPL Less human-like Better PPL Less human-like How accurate the LM’s prediction is How well surprisal explains human reading behavior
  14. Language-dependent observation [Kuribayashi+, ACL21] • Less uniform information density (UID)

    in Japanese reading time than English • probably due to word order • There human—LM mismatch was emphasized • Accurate LMs tend to yield surprisal with less variations w.r.t. word category • Inconcsistent with reading time • Hedge: English and Japanese have various differences. Threre is a room to investigate what was the exact cause of this gap? 15 the parameter update number in each language. The training data size and the parameter update number are represented as logarithmically trans- formed numerical factors. The following trends were found: (i) no significant difference by model architecture; (ii) the training data size positively af- fects the performance in English alone; and (iii) the number of parameter updates positively affects the performance only in English. There was no factor that boosted the psychometric predictive power of LMs in both English and Japanese languages. 4.3 Discussion: uniform information density The key question is: why do Japanese and English show different trends between PPL and psychomet- 0 20 40 60 80 −15 −5 5 15 tokenN_in_sent s(tokenN_in_sent,3.7) 0 5 10 20 −100 −50 0 50 tokenN s(tokenN,2.62) Change of gaze duration (ms) position in sentence position in sentence Change of gaze duration (ms) Dundee Corpus (English) BCCWJ-EyeTrack (Japanese) Figure 4: Uniformity of gaze duration with respect to segment position in a sentence. This plot is com- puted by the generalized additive model of GD ⇠ segmentN. Here, segmentN is denoted as the posi- tion of a segment in a sentence. Based on this observation, the discrepancy be- Lower Perplexity is Not Always Human-Like Tatsuki Kuribayashi1,2, Yohei Oseki3,4, Takumi Ito1,2, Ryo Yoshida3, Masayuki Asahara5, Kentaro Inui1,4 1Tohoku University 2Langsmith Inc. 3University of Tokyo 4RIKEN 5NINJAL {kuribayashi, takumi.ito.c4, inui}@tohoku.ac.jp , {oseki, yoshiryo0617}@g.ecc.u-tokyo.ac.jp , [email protected] Abstract In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the re- cent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different struc- tures from English. Our experiments demon- strate that this established generalization ex- hibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information den- sity. Overall, our results suggest that a cross- human language processing. For example, recent studies reported that LMs with better performance for next-word prediction could also better predict the human reading behavior (i.e. more human- like) (Fossum and Levy, 2012; Goodkind and Bick- nell, 2018; Wilcox et al., 2020). In this paper, we re-examine whether the re- cent findings on human-like computational mod- els can be generalized across languages. Despite the community’s ongoing search for a language- independent model (Bender, 2011), existing stud- ies have focused almost exclusively on the English language. Having said that, broad-coverage cross- linguistic evaluation of the existing reports is pro- hibitively difficult. In fact, data on human reading behavior (e.g., eye movement) is available only in limited languages. As an initial foray, this study focuses on the Japanese language as a representa- tive of languages that have typologically different characteristics from the English language. If the ob- servation is different between English and Japanese, the current findings on English data might lack a Skip Ad
  15. Fundamental AI—human alignment problem Why LMs deviate from humans has

    been actively explored: • Superhuman prediction in specific words: named entity [Oh&Schler,23], low-frequent words [Oh+,24] • Tokenization matters? [Nair&Resnik,23] • Contaminatin of reading time corpus? [Wilcox+,23] • Need for re-analysis (slow, syntactic) system? [van Schijndel&Linzen,21][Wilcox+,21][Huang+,24] 16 humans LMs How can one close LMs to humans? AI-alignment problem
  16. Excessive context access of neural LMs [Kuribayashi+, EMNLP22] • Context

    limitations for neural LMs recover their cognitive plausiblity • − log 𝑝neural 𝑤# noise 𝑤7:#'7 = − log 𝑝neural 𝑤# 𝑤#'8)7:#'7 • Fair comparison among LMs with different context access, cf. comparing count-based n-gram and neural LMs 17 … people wearing a red hat come … Gap? Working memory limitation … people wearing a red hat come [Next token] Enhanced context access (Transformer) better reading time modeling More severe noise Japanese English Simple linear memory decay Context Limitations Make Neural Language Models More Human-Like Tatsuki Kuribayashi1,2, Yohei Oseki3,4, Ana Brassard1,4, Kentaro Inui1,4 1Tohoku University 2Langsmith Inc. 3University of Tokyo 4RIKEN {kuribayashi, inui}@tohoku.ac.jp , [email protected], [email protected] Abstract Language models (LMs) have been used in cog- nitive modeling as well as engineering studies— they compute information-theoretic complexity metrics that simulate humans’ cognitive load during reading. This study highlights a lim- itation of modern neural LMs as the model of choice for this purpose: there is a discrep- ancy between their context access capacities and that of humans. Our results showed that constraining the LMs’ context access improved their simulation of human reading behavior. We also showed that LM-human gaps in context access were associated with specific syntactic constructions; incorporating syntactic biases into LMs’ context access might enhance their cognitive plausibility.1 1 Introduction In computational psycholinguistics, human read- ing behavior has been compared with various complexity metrics to understand human sentence processing (Crocker, 2007). Having historically started from simple measures such as word length, surprisal ( log p(word|context)) computed by language models (LMs) has become a common choice (Levy, 2008; Smith and Levy, 2013). On top of this, the next question arises—which model implementation and/or algorithm can compute sur- prisal that successfully simulates human behavior? In this line of research, modern neural LMs such as Transformer (Vaswani et al., 2017) have been analyzed with respect to their cognitive plausibil- ity (Wilcox et al., 2020; Merkx and Frank, 2021; Kuribayashi et al., 2021). Figure 1: Relationship between psychometric predictive power (PPP) of LMs and input length (e.g., input length of three corresponds to 3-gram surprisal). The marker color and shape indicate language model architectures, and colored areas present one standard deviation. number of context tokens, while humans might have limited and selective context access (Hawkins, 1994; Gibson, 1998, 2000; Lewis et al., 2006). Searching for a computational model that better simulates human sentence processing than previ- ously examined ones, we hypothesized that intro- ducing such context limitations can improve LMs’ estimation of human cognitive load. Specifically, as a starting point, we applied an Skip Ad
  17. • Context limitations for neural LMs recover their cognitive plausiblity

    • When does (not) longer context help?---it depends on syntactic relation type 18 … people wearing a red hat come … Gap? Working memory limitation … people wearing a red hat come [Next token] Enhanced context access (Transformer) ∆LogLik./012/03453 − ∆LogLik67/832/03453 Limiting the number of tokens in the context window weakens these associations for predicting rare words, which is most likely the reason why this improves the fit of LM surprisal to reading times, as demonstrated by Kuribayashi et al. (2022). Recent follow-up work Excessive context access of neural LMs [Kuribayashi+, EMNLP22] Context Limitations Make Neural Language Models More Human-Like Tatsuki Kuribayashi1,2, Yohei Oseki3,4, Ana Brassard1,4, Kentaro Inui1,4 1Tohoku University 2Langsmith Inc. 3University of Tokyo 4RIKEN {kuribayashi, inui}@tohoku.ac.jp , [email protected], [email protected] Abstract Language models (LMs) have been used in cog- nitive modeling as well as engineering studies— they compute information-theoretic complexity metrics that simulate humans’ cognitive load during reading. This study highlights a lim- itation of modern neural LMs as the model of choice for this purpose: there is a discrep- ancy between their context access capacities and that of humans. Our results showed that constraining the LMs’ context access improved their simulation of human reading behavior. We also showed that LM-human gaps in context access were associated with specific syntactic constructions; incorporating syntactic biases into LMs’ context access might enhance their cognitive plausibility.1 1 Introduction In computational psycholinguistics, human read- ing behavior has been compared with various complexity metrics to understand human sentence processing (Crocker, 2007). Having historically started from simple measures such as word length, surprisal ( log p(word|context)) computed by language models (LMs) has become a common choice (Levy, 2008; Smith and Levy, 2013). On top of this, the next question arises—which model implementation and/or algorithm can compute sur- prisal that successfully simulates human behavior? In this line of research, modern neural LMs such as Transformer (Vaswani et al., 2017) have been analyzed with respect to their cognitive plausibil- ity (Wilcox et al., 2020; Merkx and Frank, 2021; Kuribayashi et al., 2021). Figure 1: Relationship between psychometric predictive power (PPP) of LMs and input length (e.g., input length of three corresponds to 3-gram surprisal). The marker color and shape indicate language model architectures, and colored areas present one standard deviation. number of context tokens, while humans might have limited and selective context access (Hawkins, 1994; Gibson, 1998, 2000; Lewis et al., 2006). Searching for a computational model that better simulates human sentence processing than previ- ously examined ones, we hypothesized that intro- ducing such context limitations can improve LMs’ estimation of human cognitive load. Specifically, as a starting point, we applied an Skip Ad
  18. 19 Psychometric Predictive Power of Large Language Models Tatsuki Kuribayashi1

    Yohei Oseki2 Timothy Baldwin1,3 1MBZUAI 2The University of Tokyo 3The University of Melbourne {tatsuki.kuribayashi,timothy.baldwin}@mbzuai.ac.ae [email protected] Abstract Instruction tuning aligns the response of large language models (LLMs) with human prefer- To appear in NAACL 2024 Findings
  19. This study: cognitive modeling with large language models (LLMs) •

    Several efforts are made towards AI--human alignment, such as instruction- tuning (RLHF) • Q. Did these efforst fill the gap highlighted in the cognitive modeling fields? • A. The advancement in LLMs is independent of cognitive modeling • None of the measurement from LLMs outperformed bare word probability from base (non- instruction-tuned) LLMs • Human sentence processing would be tuned to next-word predictability (not so accurate though) 20 Less accurate next-word prediction Recent efforts in alignment (e.g., instruction tuning) Ealier works [Goodkind&Bicknell,18] Recent works [Kuribayashi+,22][Oh&Shuler,23] Ours [Kuribayashi+,24] Accurate next-word prediction
  20. Experiment 1: instruction tuning • Does instruction tuning of LMs

    improve their cognitive plausibility? • Hypothesis for yes • Humans would predict their preferred texts (e.g., less hallucination) during sentence processing • Hypothesis for no • Fine-tuning may amplify the reporting bias in training data, while humans are tuned to pure language statistics • Instruction-tuning objective is to create a superhuman chatbot, not aligned with the goal of cognitive modeling 21 Which is likely? Surprisal Surprisal Base LMs Instruction-tuned LMs Which is more similar to humans?
  21. General experimental setting • Explain reading time with surprisal and

    baseline factors • Evaluation metric • Increase in loglikelihood (psychometric predictive power; PPP) between the regression models with and without the surprisal factor • Higher PPP is better • 2 corpora • naturalistic stories and dundee corpus • 3 measurements • Surprisal (ℎ), Shannon entropy (𝐻), and Renyi rentropy α=0.5 (𝐻+.- ) • 26 models • GPT2 (177M-1.5B), OPT (125M-66B), GPT3 (bebbage-002, davinci-002), GPT-3.5 (text-davinci- 002/003), Llama-2 (7-70B), Llama-2-instruct (7B-70B), Falcon (7B, 40B), Falcon-instruct (7B, 40B) 22 Reading_time (word) ~ surprisal (word) + baseline_factors(word) Unigram prob. and length of t, t-1, and t-2 tokens
  22. information-theoretic values using intra-sentential context since we are interested in

    sentence-level syntactic processing in this study. 2.2 Experimental settings Models: We examined 26 LLMs as candidate models ✓ to compute information-theoretic val- ues: four GPT-2 (Radford et al., 2019), four GPT- 3/3.5 (Ouyang et al., 2022)5, six LLaMA-2 (Tou- vron et al., 2023), four Falcon (Almazrouei et al., 2023), and eight OPT (Zhang et al., 2022) models with different sizes and instruction tuning settings (see Appendix A for details). Among them, two GPT-3.5, two LLaMA-2, and two Falcon models were fine-tuned with instruction tuning (models with X in the “IT” column in Table 1 are IT-LLMs), and the others are “base LLMs.” Entropy metrics are omitted from the GPT-3/3.5 results since their APIs do not provide the probability distribution across their entire vocabulary. Data: We use the Dundee Corpus (DC) (Kennedy et al., 2003) and Natural Stories Corpus (NS) (Futrell et al., 2018) for reading time data.6 DC NS Model IT h " H "H0.5 " PPL # h " H "H0.5 " PPL # GPT-2 177M 15.2312.32 15.55209.3715.6110.20 18.19 93.81 GPT-2 355M 9.6311.20 15.37222.1713.62 8.91 16.96 75.67 GPT-2 774M 10.98 9.66 14.79165.8112.04 7.01 14.52 66.87 GPT-2 1.5B 10.18 - 14.15158.7510.94 6.99 14.69 65.14 GPT-3 B2 12.47 - -108.7710.58 - - 57.91 GPT-3 D2 9.93 - - 79.65 6.45 - - 44.79 GPT-3.5 D2 X 9.35 - - 72.95 5.30 - - 38.23 GPT-3.5 D3 X 8.91 - - 84.17 5.83 - - 44.38 LLaMA-2 7B 10.33 8.58 13.45 76.40 6.41 3.06 9.97 45.21 LLaMA-2 7B X 8.97 5.57 12.03153.46 7.07 2.42 8.33 63.74 LLaMA-2 13B 9.44 8.04 13.77 75.28 5.44 2.44 9.23 41.62 LLaMA-2 13BX 9.13 5.30 11.97123.35 5.93 1.99 7.53 56.05 LLaMA-2 70B 8.21 5.14 10.47 78.28 4.51 1.80 6.79 37.61 LLaMA-2 70BX 8.67 4.53 10.67112.07 5.60 1.75 7.34 52.05 Falcon 7B 9.08 7.75 11.81 97.86 7.61 3.95 12.17 49.64 Falcon 7B X11.18 8.57 12.31131.53 8.54 4.38 12.63 62.99 Falcon 40B 8.53 6.93 10.99 77.72 5.35 2.41 9.36 41.46 Falcon 40B X 9.06 6.76 10.43 92.53 5.49 2.89 8.49 47.27 OPT 125M 15.6513.72 17.18231.8015.5412.27 19.41109.11 OPT 350M 14.8111.89 16.07196.0214.8610.35 18.11 94.51 OPT 1.3B 10.5110.16 15.55160.9511.81 7.43 16.53 67.59 OPT 2.7B 9.52 9.65 14.38150.7811.66 6.60 15.51 63.98 OPT 6.7B 9.43 9.06 13.63130.01 9.59 5.56 13.64 57.86 OPT 13B 9.06 8.57 13.15130.44 9.51 4.96 12.84 56.74 OPT 30B 9.62 8.58 13.17119.42 8.55 4.16 10.39 54.91 OPT 66B 10.30 7.42 12.73 94.15 7.78 4.33 11.92 49.11 Table 1: The PPL and PPP scores of tested LMs. The “IT” column denotes whether the instruction tuning is applied. The columns h, H, and H0.5 indicate surprisal, Shannon entropy, and Rényi entropy (↵ = 0.5) settings, Result of experiment 1: instruction tuning 23 • No evidence for the positive effects of instruction tuning in cognitive modeling • Instrucion tuning frequently hurt the PPP • No specific LLM family had a high PPP PPP of models; improvement and degradation of PPP due to instruction tuning
  23. Result of experiment 1: instruction tuning • Instruction-tuned models can

    not balance PPL and PPP • Instruction-tuned LLMs always have worse PPP than base LMs with equivalent PPL 24 Dundee Corpus Natural Stories Corpus LLaMA-2 Falcon GPT-3/3.5 OPT Model family Instruction-tuning Model size Tuned (IT) Not-tuned (Base) smaller larger GPT-2 worse better better worse Worse than base LMs
  24. Experiment 2: prompting • What kind of prompts make LLMs

    more human-like? • General interest: how is human sentence processing linguistically biased? 25 RT∝− log 𝑝(word|context, human_bias) RT∝− log 𝑝(word|context, prompt) E.g., Generate a grammatically simple sentence Prompt-conditioned surprisal
  25. Experiment 2: prompting 26 Please complete the following sentence to

    make it as grammatically simple as possible: 𝑤+ , 𝑤! … , 𝑤./! Please complete the following sentence with a careful on grammar: 𝑤+ , 𝑤! … , 𝑤./! Please complete the following sentence to make it as grammatically complex as possible: 𝑤+ , 𝑤! … , 𝑤./! Please complete the following sentence using the simplest vocabulary possible: 𝑤+ , 𝑤! … , 𝑤./! Please complete the following sentence with a careful focus on word choice: 𝑤+ , 𝑤! … , 𝑤./! Please complete the following sentence using the most difficult vocabulary possible: 𝑤+ , 𝑤! … , 𝑤./! Please complete the following sentence in a human-like manner. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors. 𝑤+, 𝑤! … , 𝑤./! Please complete the following sentence. We are trying to reproduce human reading times with the word prediction probabilities you calculate, so please predict the next word like a human. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors. 𝑤+ , 𝑤! … , 𝑤./! RT(𝑤&)∝− log 𝑝(𝑤&|𝑤; … 𝑤&<#, prompt) Syntax Vocabulary Task-oriented
  26. Preliminary analysis: prompting • Does prompting properly bias the next-word

    prediction?---Yes 27 Dependency length Sentence length Word frequency short long short long
  27. Results of experiment 2: prompting • Prompts mentioning grammar and/or

    simplicity improved PPP • Connection with syntactic bias and good-enough processing in humans 28 DC NS ID Prompt h " H " H0.5 " h " H " H0.5 " 1 Please complete the following sentence to make it as grammatically simple as possible:\n w0, · · · , wt 1 8.23 7.46 12.26 6.55 2.62 8.26 2 Please complete the following sentence with a careful focus on grammar: \n w0, · · · , wt 1 8.24 7.19 11.99 6.20 2.99 8.72 3 Please complete the following sentence to make it as grammatically complex as possible: \n w0, · · · , wt 1 7.77 6.99 11.74 5.66 2.54 7.75 4 Please complete the following sentence using the simplest vocabulary possible: \n w0, · · · , wt 1 7.82 7.48 12.15 5.70 3.11 8.90 5 Please complete the following sentence with a careful focus on word choice: \n w0, · · · , wt 1 7.87 6.86 11.50 6.06 2.94 8.60 6 Please complete the following sentence using the most difficult vocabulary possible: \n w0, · · · , wt 1 7.31 6.71 11.38 4.73 2.43 7.57 7 Please complete the following sentence in a human-like manner. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors.\n w0, · · · , wt 1 7.86 7.30 12.34 4.60 3.03 8.78 8 Please complete the following sentence. We are trying to reproduce human reading times with the word prediction probabilities you calculate, so please predict the next word like a human. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors.\n w0, · · · , wt 1 8.17 7.36 12.42 4.83 3.11 8.73 9 Please complete the following sentence: \n w0, · · · , wt 1 8.34 7.12 11.88 5.77 3.01 8.74 10 w/o prompting 9.32 6.15 11.48 6.25 2.69 8.86 Table 2: The PPP scores when using each prompt for different LLMs (the highest scores other than baseline ones for each corpus/metric are in boldface). Scores are averaged across the seven IT-LLMs. The columns h, H, and H0.5 indicate surprisal, Shannon entropy, and Rényi entropy (↵ = 0.5) settings, respectively. simplicity simplicity grammar vocaburaly 👍
  28. LLaMA-2 Falcon GPT-3/3.5 OPT Model family Instruction-tuning Model size Tuned

    (IT) Not-tuned (Base) smaller larger Tuned&Prompt Dundee Corpus Natural Stories Corpus GPT-2 worse better better worse Results of experiment 2: prompting • Prompt-conditioned surprisal (red-lined markers) can not outperform base LLMs with a similar PPL 29 Worse than base LMs
  29. • Directly asking cognitive loads of words via meta-linguistic prompting

    • “Hey LLMs, tell me the reading time/suprisal of this word in this sentence” • The task is simplified as a token-sorting problem w.r.t. their processing costs • 3-shot setting Experiment 3: meta-linguistic prompting 30 Suppose humans read the following sentence: "’No, it’s fine. I love it,’ said Lucy knowing that affording the phone had been no small thing for her mother." List the tokens and their IDs in order of their reading cost (high to low) during sentence processing. Token ID: 0: ’No„ 1: it’s, 2: fine., 3: I, 4: love, 5: it,’, 6: said, 7: Lucy, 8: knowing, 9: that, 10: affording, 11: the, 12: phone, 13: had, 14: been, 15: no, 16: small, 17: thing, 18: for, 19: her, 20: mother., Answer: 20: mother., 10: affording, 6: said, 11: the, 0: ’No„ 7: Lucy, 1: it’s, 9: that, 17: thing, 5: it,’, 2: fine., 15: no, 14: been, 3: I, 13: had, 8: knowing, 12: phone, 19: her, 16: small, 4: love, 18: for, Suppose humans read the following sentence: "A clear and joyous day it was and out on the wide open sea, thousands upon thousands of sparkling water drops, excited by getting to play in the ocean, danced all around." List the tokens and their IDs in order of their reading cost (high to low) during sentence processing. Token ID: … [Hu&Levy,23] Example: 1 example
  30. Results of experiment 3: meta-linguistic prompting • No correlation between

    model’s prediction and reading time • Spearman’s ρ is reported • Do models simply struggle with sorting large number of items? • No, even the order of first-three tokens listed by the models have no correl. with reading time/surprisal 31 Method (simplified prompts) Model DC " NS " Suppose humans read the following sentence: [SENT]. List the tokens in order of their reading cost (high to low) during sentence processing. LLaMA-2 7B 0.09±0.02 -0.04±0.06 LLaMA-2 13B 0.06±0.02 -0.03±0.06 Falcon 7B 0.12±0.01 0.01±0.09 Falcon 40B 0.03±0.04 -0.03±0.11 GPT3.5 D2 0.05±0.03 0.05±0.03 GPT3.5 D3 0.08±0.03 0.03±0.02 Suppose you read the following sentence: [SENT]. List the tokens in order of their probability in context (low to high). LLaMA-2 7B 0.05±0.06 0.00±0.02 LLaMA-2 13B 0.04±0.03 0.06±0.04 Falcon 7B 0.08±0.05 0.05±0.02 Falcon 40B 0.02±0.07 0.13±0.10 GPT3.5 D2 0.03±0.00 0.02±0.00 GPT3.5 D3 -0.01±0.02 0.06±0.03 Surprisal-based estimation LLaMa-2 7B 0.28 0.19 LLaMa-2 13B 0.27 0.19 Falcon 7B 0.32 0.18 Falcon 40B 0.28 0.17 GPT3.5 D2 0.28 0.16 GPT3.5 D3 0.25 0.17 Table 3: Rank correlations between estimated cognitive load and reading time of words. Model DC " NS " body of of IT-L human Promp in LL paradig LLMs v et al., 2 guš et a 2023). ability prompt and Lev ically H ancy th tinction lem is a 😨 😨
  31. • Incidental finding: weak correlation between the actual surprisals and

    those predicted by prompting • Lack of LLMs’ meta-cognition of their own surprisal Results of experiment 3: meta-linguistic prompting 32 List the tokens in order of their probability in context (low to high). Falcon 40B 0.02±0.07 0.13±0.10 GPT3.5 D2 0.03±0.00 0.02±0.00 GPT3.5 D3 -0.01±0.02 0.06±0.03 Surprisal-based estimation LLaMa-2 7B 0.28 0.19 LLaMa-2 13B 0.27 0.19 Falcon 7B 0.32 0.18 Falcon 40B 0.28 0.17 GPT3.5 D2 0.28 0.16 GPT3.5 D3 0.25 0.17 Table 3: Rank correlations between estimated cognitive load and reading time of words. Model DC " NS " LLaMA-2 7B 0.12±0.13 0.15±0.08 LLaMA-2 13B 0.02±0.10 0.06±0.07 Falcon 7B 0.15±0.08 0.30±0.09 Falcon 40B 0.09±0.09 0.17±0.00 GPT3.5 D2 0.15±0.02 0.22±0.07 GPT3.5 D3 0.18±0.05 0.24±0.02 Table 4: Rank correlations between the word probability (rank) estimated by the prompt and the actual surprisal values computed by the corresponding model. about word probability is again not an accurate measure of actual surprisal. LLMs via prompting has gaine et al., 2022; Hu and Levy, 20 guš et al., 2023; Dentella et al 2023). Prior work has point ability at linguistic judgments prompting to direct probabilit and Levy, 2023; Dentella et a ically Hu and Levy (2023) ancy the so-called competen tinction (Chomsky, 1965) of lem is also related to the calib puts (Kadavath et al., 2022) degradation of metalinguistic in simulating human reading Instruction tuning: Startin fine-tuning of LMs (Wei et a 2022), instruction-tuning—a human users—has played a oping LLMs (Ouyang et al., 2022). The objective of instr example, making models help less (Askell et al., 2021) in ad
  32. Summary • Current advancement in LLMs does not offer a

    better measurements for cognitive modeling than simple bare word probability • At least within our used models, corpora, metrics • Bare probabilities estimated by base LMs are strong predictor of human reading behavior even in the age of LLMs • Nevertheless, small LMs are intuitively non-human-like based on their poor ability… how should this gap be filled? [TODO] 33 Less accurate next-word prediction Recent efforts in alignment (e.g., instruction tuning) Ealier works [Goodkind&Bicknell,18] Recent works [Kuribayashi+,22][Oh&Shuler,23 Ours [Kuribayashi+,24] Accurate next-word predicti
  33. Information-theoretic linguistics and conitive modeling • Information-theoretic science of language

    • Linguistic typology---how well does LM-estimated predictability explain typological universals? • Emergent langauge/communucation---under what conditions natural language emerges from neural agents? 35 Hsd ad ud asjkdaj 💡
  34. Information-theoretic linguistics and conitive modeling • Information-theoretic science of language

    • Linguistic typology---how well does LM-estimated predictability explain typological universals? • Emergent langauge/communucation---under what conditions natural language emerges from neural agents? • Cognitive modeling offers more human-like information-theoretic measurements • Do human-like measurments update information-theoretic language science? 36 Hsd ad ud asjkdaj 💡 revisit
  35. Information-theoretic linguistics and conitive modeling • Information-theoretic science of language

    • Linguistic typology---how well does LM-estimated predictability explain typological universals? • Emergent langauge/communucation---under what conditions natural language emerges from neural agents? • Cognitive modeling offers more human-like information-theoretic measurements • Do human-like measurments update information-theoretic language science? 37 Hsd ad ud asjkdaj 💡 revisit Human-like measurement Simulation of a phenomenon Y e.g., lanaguge universals ① Human-like measurement ②Correlation ③ Y is also related to cognitive bias e.g., check if phenomenon Y emerges only under human- like measurement Check if a measurement simulates human-like cognitive load
  36. 39 Emergent Word Order Universals from Cognitively-Motivated Language Models Tatsuki

    Kuribayashi Ryo Ueda Ryo Yoshida Yohei Oseki Ted Briscoe Timothy Baldwin , Mohamed bin Zayed University of Artificial Intelligence The University of Tokyo The University of Melbourne {tatsuki.kuribayashi, ted.briscoe, timothy.baldwin}@mbzuai.ac.ae {ueda-ryo796, yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp Abstract The world’s languages exhibit certain so-called typological or implicational universals; for ex- ample, Subject-Object-Verb (SOV) word order To appear in ACL 2024
  37. Word order universals • Some word orders are frequent within

    attested languages • Are these variations/universals due to cognitive biases? If not, are these coincidental results? 40 chairoi lahiqa fast nekoga chiːsana nezumio subayaku oikaketa al-qittu al-bunnyyu bil-fa’ri al-saghiri sariiʕan chased small brown cat-NOM mouse-ACC chased cat-NOM brown mouse-ACC small fast Japanese Arabic
  38. Computational simulation for word order universals • What we want

    to demonstrate: • Such an ablation of language emergence in the real world is impossible 41 Word order universals emerge through communication between agents with cognitive biases Word order universals do not emerge through communication between agents without cognitive biases Hsd ad ud asjkdaj 💡 Ad asjkdaj ud hsd 💡
  39. • What we want to demonstrate: • Simulation with cognitively-plausible

    and cognitively-implausible agents Computational simulation for word order universals 42 Word order universals emerge through communication between agents with cognitive biases Word order universals do not emerge through communication between agents without cognitive biases Hsd ad ud asjkdaj 💡 Ad asjkdaj ud hsd 💡
  40. Emergent word order universals as ranking linearization strategies 43 Hsd

    ad ud asjkdaj What comes next… Hsd ad ud asjkdaj Processing cost Suppose we have several linearization strategies (word order configurations) 𝒪 𝑜#: linearzation strategy
  41. Emergent word order universals as ranking linearization strategies ① How

    high is the processing cost of word order 𝑜 ∈ 𝒪? • Train LM 𝜃 in each 𝑜 and measure perplexity/entropy • Listener does next-word prediction (expectation-based reading) • Processing cost: average predictability (perplexity/entropy) • 𝒑𝜽 = [PPL"(𝑜7), PPL"(𝑜!), … PPL"(𝑜:)] 44 Hsd ad ud asjkdaj What comes next… Word order: LLLLLLL Word order: LRLRLRL Word order: RRRRRRR tifer ob ciateda rel bullifs sub remiff bullifs rel ciateda tifer ob sub remiff remiff sub bullifs rel ciateda ob tifer … … Language model Language modeling on different word order configurations PPL Word order frequency ! Correlation? Cognitive inductive bias + Türkçe 日本語 … ﻋ ر ﺑ ﻲ … 𝑜# 𝑜= 𝑜$ ① ② ③ Hsd ad ud asjkdaj Particular linearizer Processing cost Suppose we have several linearization strategies (word order configurations) 𝒪 𝑜#
  42. Emergent word order universals as ranking linearization strategies ① How

    high is the processing cost of word order 𝑜 ∈ 𝒪? • Train LM 𝜃 in each 𝑜 and measure perplexity/entropy • Listener does next-word prediction (expectation-based reading) • Processing cost: average predictability (perplexity/entropy) • 𝒑𝜽 = [PPL"(𝑜7), PPL"(𝑜!), … PPL"(𝑜:)] ② How frequently is word order 𝑜 adopted in the world? • World Atlas of Language Structure (WALS) database • 𝒇 = [Freq(𝑜7), Freq(𝑜!), … Freq(𝑜:)] 45 Hsd ad ud asjkdaj What comes next… Word order: LLLLLLL Word order: LRLRLRL Word order: RRRRRRR tifer ob ciateda rel bullifs sub remiff bullifs rel ciateda tifer ob sub remiff remiff sub bullifs rel ciateda ob tifer … … Language model Language modeling on different word order configurations PPL Word order frequency ! Correlation? Cognitive inductive bias + Türkçe 日本語 … ﻋ ر ﺑ ﻲ … 𝑜# 𝑜= 𝑜$ ① ② ③ Hsd ad ud asjkdaj Particular linearizer Processing cost Suppose we have several linearization strategies (word order configurations) 𝒪 𝑜#
  43. Emergent word order universals as ranking linearization strategies ① How

    high is the processing cost of word order 𝑜 ∈ 𝒪? • Train LM 𝜃 in each 𝑜 and measure perplexity/entropy • Listener does next-word prediction (expectation-based reading) • Processing cost: average predictability (perplexity/entropy) • 𝒑𝜽 = [PPL"(𝑜7), PPL"(𝑜!), … PPL"(𝑜:)] ② How frequently is word order 𝑜 adopted in the world? • World Atlas of Language Structure (WALS) database • 𝒇 = [Freq(𝑜7), Freq(𝑜!), … Freq(𝑜:)] ③ Comparing these distributions: correl(𝒇, −𝒑𝜽) • Liking hypothesis: word-order with low processgin cost has been survied and thus frequently used. • Can human-like LMs yield the correlations? • If so, word order would be “evolved” to incur low processing cost for humans 46 Hsd ad ud asjkdaj What comes next… Word order: LLLLLLL Word order: LRLRLRL Word order: RRRRRRR tifer ob ciateda rel bullifs sub remiff bullifs rel ciateda tifer ob sub remiff remiff sub bullifs rel ciateda ob tifer … … Language model Language modeling on different word order configurations PPL Word order frequency ! Correlation? Cognitive inductive bias + Türkçe 日本語 … ﻋ ر ﺑ ﻲ … 𝑜# 𝑜= 𝑜$ ① ② ③ Hsd ad ud asjkdaj Particular linearizer Processing cost Suppose we have several linearization strategies (word order configurations) 𝒪 𝑜#
  44. Emergent word order universals as ranking linearization strategies • Data:

    • A set of PCFG-generated artificial language corpora ℒ [White&Cotterel 2021] with word order parameters • Each language corpus 𝑙 ∈ ℒ has a different word order configuration 𝑜 ∈ {L, R}; • 𝑜 ∈ LLLLLL, LLLLLR, … RRRRRR = 𝒪 • LLLLLL: fully left-branching language • LRRLRL: Enlgish-like word order • RRRRRR: fully right-branching language • ℒ = 𝒪 = 64(= 2;) • PPL differences can only stem from model’s inductive biases • All the corpora have the same generation probability w.r.t. the respective grammar 47 processing effort required to process sentences with word order o. This can be quantified by perplexity (PPL),1 the geometric mean of word predictability, of a corpus Lo following the word order o: E↵ort(o) ⇠ Y wi2Lo p✓(wi|w<i) 1 |Lo| . (2) Here, the probability is computed by an LM ✓. We analyze what word order incurs more processing costs for particular LMs. Note that, more generally, human language is arguably designed to minimize complexity (how unpredictable symbols are) while maintaining in- formativity (how easy it is to extract a message from symbols) (i Cancho and Solé (2003); Pianta- dosi et al. (2012); Kemp and Regier (2012); Frank and Goodman (2012); Kirby et al. (2015); Kanwal et al. (2017); Gibson et al. (2019); Xu et al. (2020); Param. L R sS Cat eats. Eats cat. sVP Cat mouse eats. Cat eats mouse. sPP Cat table on eats. Cat on table eats. sNP Small cat eats. Cat small eats. sRel Likes milk that cat eats. Cat that likes milk eats. sCase Cat-sub eats. Sub-cat eats. Table 1: Word-order parameters and example construc- tions with different assignments, L or R (See Apps. A and B and White and Cotterell (2021) for details). Word order: LLLLLLL Word order: LRLRLRL Word order: RRRRRRR tifer ob ciateda rel bullifs sub remiff bullifs rel ciateda tifer ob sub remiff remiff sub bullifs rel ciateda ob tifer … … Language model Language modeling on different word order configurations PPL Word order frequency ! Correlation? Cognitive inductive bias + Türkçe 日本語 … ﻋ ر ﺑ ﻲ … Suppose we have several linearization strategies (word order configurations) 𝒪
  45. • 23 LMs (on 64 languages): • Human-likeness: • syntactic

    bias [Hale+,17] • left-corner traversal [Resnik,1992][Yoshida+,21] • memory limitation [Futrell+,20][Kuribayashi+,22] • PLM (parsing-as-LM) predicts parsing-action sequences • RNNG (recurrent neural network grammar) has hierarchical, compositional operation • SRNNG is an RNNG with simple RNN (without bi-directional LSTM) Experimental setting 48 syntax traversal memory lim. Transformer LSTM ✓ SRN ✓ {5,4,3}-gram ✓ Transformer PLM ✓ Top down Transformer PLM ✓ Left corner* LSTM PLM ✓ Top down ✓ LSTM PLM ✓ Left corner* ✓ SRN PLM ✓ Top down ✓ SRN PLM ✓ Left corner* ✓ {5,4,3}-gram PLM ✓ Top down ✓ {5,4,3}-gram PLM ✓ Left corner* ✓ RNNG ✓ Top down RNNG ✓ Left corner* SRNNG ✓ Top down ✓ SRNNG ✓ Left corner* ✓ LLaMA-2 *arc- standard traversal is used due to implementa tion issue NT(NP) GEN(The) REDUCE…
  46. Results • 5 seeds for each model • Ideal: •<▪<▲

    • Regression analysis shows all the factors are effective • Syntactic bias • Left-corner trav. • Context limitation (but only for RNNGs) 49 w/o syntax Top-down (TD) syntactic LM Left-corner (LC) syntactic LM Memory limit. Memory limit. Memory limit. Memory limit. Memory limit. Memory limit. w/o syntax Top-down (TD) syntactic LM Left-corner (LC) syntactic LM Figure 3: The results of global/local correlations. Each point corresponds to each run. Their colors and shapes denote the syntactic bias of the models. The TD and LC variants in the Transformer, LSTM, SRN, and N-gram settings correspond to the respective PLMs. The box presents the lower/upper quartiles. Memory limitation: In addition to syntactic bi- ases, we focus on memory limitations as human- PPLo(x, y) := Q t p(at|a<t) 1 |a| . We also exam- ine a token-level predictability PPLo(x) in §7.1 Exceptional case Relaxed version of Correl(𝒇, −𝒑) (intra-group correls. within each base group: LLXXXX (SOV), LRXXXX (SVO), RLXXXX (OVS), RRXXXX (VOS)
  47. SVO Figure 6: Predictability and parsability of each word order.

    These measurements are converted through the min-max normalization to be [0, 1] scale (higher is better). Each circle corresponds to each word order; larger ones are frequent word orders. models: Base: Freq(o) ⇠ PPLo(x) , +Parse: Freq(o) ⇠ PPLo(x) + Parseo(x, y) . The increase in log-likelihood scores of the +Parse model over the Base model is not sig- nificant with the likelihood-ratio test (p > 0.1) Figure 7: Illustration of the relationship between pr dictability (y-axis) and word order frequency in each o the four base-order groups (SOV, SVO, OVS, and VOS Each circle corresponds to each word order; larger one are frequent word orders. Predictability is the negativ PPLs converted through the min-max normalizatio thus higher predictability indicates lower PPL. The r sults are from the 3-gram PLM with the LC strategy. group (SOV, SVO, OVS, VOS), common word o ders tend to obtain high predictability (i.e., lowe PPL; bigger circles are at the top) except OVS order’s high predictability and SVO-order’s low predictability. This made it clear that predictabilit generally explains word-order universals, but th Another findings • 😨 Predictability exlpains many word order universals but cannot explain subject-first bias • Larger circle • (more frequent) should be positioned upper↑ (more predictable) • Some infrequent orders have a high predictability (OVS>SVO) • humans have subject-first biases, while LMs do not, or limitations in artificial language • 😃 Syntactically-biased predictability (PPL) entails parsability • Predictability+Parsability+MemoryLimitation+LeftCorner… = surprisal from human-like LM 50 ? ? SVO = = Figure 6: Predictability and parsability of each word order. These measurements are converted through the min-max normalization to be [0, 1] scale (higher is better). Each circle corresponds to each word order; larger ones are frequent word orders. Figure 7: Illustration of the relationship between pre- dictability (y-axis) and word order frequency in each of the four base-order groups (SOV, SVO, OVS, and VOS). Each circle corresponds to each word order; larger ones are frequent word orders. Predictability is the negative PPLs converted through the min-max normalization; Predictability Parsability [Hahn+,20] Algorithmic level (“how” question) Computational level
  48. Related research • LMs can not distinguish possible language from

    the impossible, thus they say nothing about language [Chomsky+,23] (counter-claim [Kallini+,24]) • Which LMs can distinguish typical word orders and untypical one? • Relatively more “natural” but controlled testbed to investigate LM/theory • Formal languages 𝑎:𝑏:𝑐: are typically used to test the inductive bias of LMs • Which language is hard to language model? • Our direction is somewhat opposite towards equity, though 51 More nuanced, close boundary between the possible and impossible [Kallini+,24]
  49. RE: Information-theoretic linguistics and conitive modeling • Information-theoretic science of

    language • Linguistic typology---how well does LM-estimated predictability explain typological universals? • Emergent langauge/communucation---under what conditions natural language emerges from neural agents? • Cognitive modeling: particular models can compute more human-like information-theoretic measurements • Will human-like measurments update the result of information-theoretic langugae science? i.e., is the phenomena of interest related to cognitive biases? 52 Hsd ad ud asjkdaj 💡 revisit Human-like LMs better led to emergence of word order universals. I.e., attested word order is biased toward predictability under cognitively-plausible biases.
  50. Summary: from cognitive modeling to linguistic typology • Accurate LMs

    deviate from humans through the lens of cognitive modeling. • Lower perplexity is not always human-like [Kuribayashi+, ACL2021] • Why? • Context limitations make neural language models more human-like [Kuribayash+, EMNLP2022] • Is this mismatch addressed by recent efforts in human-LLM alignment? ---No • Psychometric predictive power of large language models [Kuribayashi+, NAACL2024 Findings] • We have relatively human-like LMs (it’s not LLM though). Can these be leveraged to answer broader questions in linguistics? • Case study: Emeregent word order universals from cognitively-motivated language models [Kuribayashi+, ACL2024] 53