autoregressive model using GPT-NeoX ◦ Trained w/ 750B tokens (Japanese/English) ▪ Japanese/English Wikipedia ▪ Japanese CC-100, mC4 ▪ extended Japanese OSCAR • Japanese-StableLM-Instruct-Alpha-7B ◦ SFT the above base model with Japanese instruction ▪ Stanford Alpaca ▪ Dolly-15k ▪ Japanese translation of Anthropic HH ◦ Trained with 3 epoch
autoregressive model using GPT-NeoX ◦ Trained w/ 750B tokens (Japanese/English) ▪ Japanese/English Wikipedia ▪ Japanese CC-100, mC4 ▪ extended Japanese OSCAR • Japanese-StableLM-Instruct-Alpha-7B ◦ SFT the above base model with Japanese instruction ▪ Stanford Alpaca ▪ Dolly-15k ▪ Japanese translation of Anthropic HH ◦ Trained with 3 epoch
• Working on our own fork of EleutherAI/gpt-neox • Gated MLP (MLP with Gated Linear Unit) ◦ Becoming common op since SwiGLU and the other variants have been shown to improve perf. ◦ from BTLM-3B-8K: “We also found … SwiGLU further improved training efficiency.” • xPos (Extrapolatable Position Embedding) ◦ Our JSLM has seqlen=2048 ◦ xPos is a upgraded version of RoPE from Roformer ◦ more stable for long-term context modeling • Advice: consider the cost to introduce new model arch and make sure to properly implement your own NeoX->HF conversion to be user friendly 🤗
kinds of CL. The core idea of CL is to presents easier/simpler examples earlier during training and gradually increases the sample difficulties. • Aim to improve NN’s training stability or training efficiency, but usually not both. • We used Sequence Length Warmup (SLW) which define the difficulty as seqlen. In our 1B runs, SLW improved both training stability and efficiency which is great! 😎 • Start w/ SL=64, use 20k steps to reach SL=2048 (make sure to count #tokens right!) Li+, The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models, NeurIPS’22 https://arxiv.org/abs/2108.06084
and English corpus during pretraining (check list of corpus in our model page) and simply set the language ratio as 50:50. • For different domains (web/wiki/books etc), we eventually set the domain weight based on their quantity (#tokens) so large portion of final corpus is web-based text (~85%). • We tried DoReMi to find the optimal domain weights w/ our own implementation but had no success. Recently, the paper author released their official code so we might revisit this later. Xie+,DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining, 2023. https://arxiv.org/abs/2108.06084
sleep well at night • Use BF16 for mixed precision. FP16 might work if seqlen is short (e.g, 256-512), but OPT-175B and Bloom showed us for longer seqlen probably better to use BF16. • If loss spikes happen still, try setting gradient clipping to 1 and use lower LR (for 7B model, 1.2e-4 should work) • Try curriculum learning like SWL, or skip some batches right before spikes to bypass the spike.
autoregressive model using GPT-NeoX ◦ Trained w/ 750B tokens (Japanese/English) ▪ Japanese/English Wikipedia ▪ Japanese CC-100, mC4 ▪ extended Japanese OSCAR • Japanese-StableLM-Instruct-Alpha-7B ◦ SFT the above base model with Japanese instruction ▪ Stanford Alpaca ▪ Dolly-15k ▪ Japanese translation of Anthropic HH ◦ Trained with 3 epoch
autoregressive model using GPT-NeoX ◦ Trained w/ 750B tokens (Japanese/English) ▪ Japanese/English Wikipedia ▪ Japanese CC-100, mC4 ▪ extended Japanese OSCAR • Japanese-StableLM-Instruct-Alpha-7B ◦ SFT the above base model with Japanese instruction ▪ Stanford Alpaca ▪ Dolly-15k ▪ Japanese translation of Anthropic HH ◦ Trained with 3 epoch
started to build Japanese LLM months ago, we were surprised that there is not easy way to evaluate these models’ Japanese ability. • We extended lm-evaluation-harness with the community to allow everyone to evaluate Japanese LLMs easily. • Useful for evaluating general Japanese NLU ability but hard to evaluate NLG ability esp. for autoregressive models like GPT.
by LLM-as-a-judge approach, we also extended their work for Japanese LLMs evaluation recently. • Japanese MT-Bench includes: ◦ 160 2-turn Japan/Japanese-relevant questions across 8 domains (writing, roleplay, math etc) ◦ auto-graded by GPT-4, serve as a proxy measure to enable fast iteration ◦ used with lm-evaluation-harness together to get a better understanding of how these Japanese LLMs perform. • Feel free to send PRs and collaborate!
Cross-Lingual Transfer 英語で学習したモデルに、日本語で追加の学習を行う 例:elyza/ELYZA-japanese-Llama-2-7b, BLOOM+1, x-LLAMA Yong+, BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting, ACL 2023. https://arxiv.org/abs/2212.09535 Zhu+, Extrapolating Large Language Models to Non-English by Aligning Language, 2023. https://arxiv.org/abs/2308.04948 Japanese StableLM StableLM
• できるモデルに定性的な違いはあるのか? Curriculum Learning 2つのアプローチを組み合わせる。例えば、少しずつ日本語の割合を増やす。 例:PolyLM Wei+, PolyLM: An Open Source Polyglot Large Language Model, 2023. https://arxiv.org/abs/2307.06018 🤔
Learningの効率が悪くなる 👎 • 他の影響は? 🤔 Artexe+, On the Cross-lingual Transferability of Monolingual Representations. ACL 2020. https://arxiv.org/abs/1910.11856 Vries+, As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages, ACL 2021. https://arxiv.org/abs/2012.05628 Minixhofer+, WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models, NAACL 2022. https://arxiv.org/abs/2112.06598 Ostendorff+, Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning, 2023. https://arxiv.org/abs/2301.09626