texts 1st token 2nd token 3rd token Last token Hello <s> world </s> A text • Desirable text (e.g., human readable) gets high value • Undesirable text (e.g., random) gets low value Texts = Array of tokens What is the relationship w/ generation?
chain rule of conditional probability: Predict 1st word Predict 2nd word given the 1st word Predict 3rd word given the 1st/2nd words Next token prediction model (autoregressive model) Given a history of tokens, predict the next one Token to append History
generation algorithm: history = [“<s>”] while history[-1] != “</s>”: history.append(sample w from P) return history <s> <s> <s> Hello Hello world <s> Hello world </s> Sample Sample Sample Very simple, but recent line of research revealed that if the LLM is intelligent enough, next token prediction can solve tasks described in natural language Anyway, how do we construct ?
models: • Neural network methods (2001~) • Feed-forward network based (2001) • RNN-based (2010) • Transformer (2017) • Majority of LLM architecture • Handles history of any lengths (in theory) Figure taken from: https://arxiv.org/abs/1706.03762 Only this part is used to construct LMs
• 2019 GPT-2 (1.5B params) • 2020 GPT-3 (175B params) • 2022 GPT-3.5 / InstructGPT • 2022 ChatGPT • Huge social impact • 2023 GPT-4 (2T? params) • High performance on national exams: • US legal bar exam, USMLE (medical), SAT Exponential increase of #params ++ #layers ++ #hidden units ++ #attention heads Large models is capable to handle complex inference in next token prediction Figure taken from: https://arxiv.org/abs/1706.03762
Providing financial support for setting up 3rd-party compute resources • ABCI supercomputer • Providing compute support for LLM development • GENIAC (NEDO) • Providing financial/compute support for LLM development Cabinet Office(内閣府) • Providing support LLM development for medical domain (SIP) MIC(総務省) • NICT • Develop owned LLM model with their own corpus/resources MEXT(文部科学省) • University of Tokyo • Preparing computing resources for LLM/other foundation model • RIKEN • Experiment to employ Fugaku supercomputer for LLM training • National Institute of Informatics • Organize LLM-jp • R&D center for LLM
text data is required to train • Trillions of tokens should be prepared • E.g, LLaMA 2: 2T tokens, LLaMA 3: 15T tokens • Collecting data is challenging especially non-English languages • Only ~1T open data is available in Japanese • Compute • Huge computing cluster is required to handle training jobs • GPT-3 scale models (175B) require hundreds ~ thousands of H100 GPUs to train • Even small models (1B) require tens of H100 GPUs to train within a handy time • Engineering • Human experts are also required to handle large scale data collection, developing/managing training pipelines, and computing resources,
13B experimental model 2023.11 Trial of training GPT-3 level models (~175B) 2024.4 Started 172B training Established R&D Center for LLM in NII • Develop open & Japanese-oriented LLM • Unravel LLM’s working principle • Publish ALL documents, including discussion and failures • Any people can participate as long as complying our policy over 2000 members LLM-jp (LLM勉強会)
model (e.g., chat) Pre-training corpus Tuning dataset Pre-training Tuning Evaluation Application Requirements: 100~10000 H100 GPUs Trillion scale text corpus Requirements: 1~100 H100 GPUs Million-Billion scale text corpus
CC 380B NDL PDF/HTML 207B KAKEN 1B English Wikipedia 5B Dolma 945B Korean Chinese Wikipedia 1B Code Stack 114B • Upsampled dataset of the corpus v3 • (adjusted for 2.1T training) LLM-jp Corpus v4 (ongoing) • Under preparation(up to 20T tokens, 700B in Ja) • Add many data sources in Ja with accurate filtering • Add significant amount of En Web data • Add Ko/Zh Web data LLM-jp Corpus: Our Pre-training Corpus
• Use dataset for any purposes (if license allowed) • L2: train, search • Prohibited to re-distribute data • L3: train • Prohibited to expose data • LX: no-train • Use dataset at only test time • LZ: no-use • Don’t use dataset for any purposes
Adam Optimizer • Important role for model convergence • Should set to 1e-8 • It was 1e-5 in the LLaMA2 technical report, but this setting lead to fail training of large models • LLM-jp conducted some ablation study and reported results to Japanese community • Confirmed that some other organization faced the same problem • Huge amount of loss of compute 1e-5 experiment: very slow convergence 1e-8 experiment: 3x faster
Training large models sometimees, suddenly fails with exploding loss value • It comes from training instability, typically with large learning rate • LLM-jp: designed a mitigation plan when we encountered a critical spike Spike Instability of gradients
Training with 8-bits floating points are also investigated • Low computation cost but large numeric error • LLM-jp: conducted ablation study with 13B long-run training • Our result shows 8-bits training yields worse model and high instability • We didn’t adopt 8-bits training in our pipeline
appropriateness for Japanese chatbot • Consisting of QA pairs on sensitive topics • Discrimination • Criminal activities • Harmless responses • Version 1 (945 entries) April 2024 • Version 2 (1,800 entries) September 2024
at home. Example I'm sorry, but I cannot answer questions related to activities that could potentially lead to terrorism, such as weapon manufacturing. LLM-jp-3 w/o tuning To make a gun, you'll need several materials: … LLM-jp-3 w/ tuning Manufacturing a gun at home is a criminal act that could lead to charges such as violations of the Weapons Manufacture Law or the Firearms and Swords Control Law. AI cannot assist in criminal activities.
every experiment in LLM-jp. • All jobs must be tagged with the experiment number (or forcibly killed) • All data must be stored to a numbered directory: • Experiment numbers are managed on GitHub Issues. • Very easy to teach new users how to assign it
of publicly available large-scale training corpora • Completion of training a 100B-scale LLM from scratch • Achieved performance surpassing GPT-3.5 level on downstream tasks • Establishment of MoE (Mixture-of-Experts) training methods with new algorithm • Future Plans: • Further enrichment of corpora • Exploration of new media • Crawling • Securing sufficient expertise in pre-training techniques • Extending capabilities towards more complex reasoning • Extending towards multi-modality