Experiments on transfer learning of English- trained language models to understand and generate Taiwanese Mandarin (Traditional Chinese). Pokai Chang 2023/06
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/).
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF 3FQMZ 4JNQMJGJFE$IJOFTF
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/).
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF 3FQMZ 4JNQMJGJFE$IJOFTF
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models.
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model): Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915.
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model)... Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915. 100% Simpli fi ed Chinese Traditional Chinese
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model)... Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915. 5SBEJUJPOBM$IJOFTF
English language model to learn Traditional Chinese? LLMs are known for their ability to do transfer learning. Why train a new language model? • We can take advantage of the emerging English language models, and turn them into Traditional Chinese language models! 🦙🦙🦙 🦙🦙🦙 LLaMA, Pythia, MPT, … Reproducible process zh-tw LLaMA, zh-tw Pythia, zh-tw MPT, …
based on pythia-6.9b In the video, the model claims that it was trained by OpenAI. It’s not true. The shown model was trained based on pythia-6.9b. The model says so because it’s trained with ShareGPT data, which includes conversations that the AI introduces itself as ChatGPT trained by OpenAI.
the impact of Simpli fi ed Chinese. • Make the learning results stand out. • Pythia has various versions with different sizes (70M ~ 12B). • We can do experiments on small models and scale it up to larger ones. Why Based on Pythia? The Training Process
• When we learn a new language, we do not need to re-learn basic logic and reasoning abilities. • It might be the same for language models - we might not need to train all the parameters.
model is learning by monitoring the embedding of each new token. • Train a complete model and see how it performs. • Apply the training on more open source models. • Take the trained models and have some fun! • 拿政治⼈物接受問答的內容來 train ⼀個 model ⋯⋯