evaluation. We report the validation perplexity cross two setting: 8 experts and 32 experts. Model Perplexity # 8 Experts 32 Experts English-focused language modeling Dense (without Experts) 16.23 16.23 X-MoE 14.82 11.96 MH-MoE (Ours) 12.72 10.28 Multi-lingual language modeling Dense (without Experts) 8.56 8.56 X-MoE 7.19 6.02 MH-MoE (Ours) 6.26 5.09 Masked multi-modal modeling Dense (without Experts) 17.95 17.95 X-MoE 16.34 12.68 MH-MoE (Ours) 14.73 10.87 efits from enhanced representation learning capabilities as more experts are incorporated. These results collectively demonstrate the superiority of MH-MoE in terms of learn- ing efficiency and language representation across multiple pre-training paradigms. 4.3. Downstream Evaluation For each pre-training task, we conduct corresponding downstream evaluation to validate the efficacy of MH- MoE. Multi-Head Mixture-of-Experts Table 2. Accuracy / accuracy-normalization scores for language understanding tasks using the LLM Evaluation Harness (Gao et al., 2023). Model ARC-Challenge ARC-Easy RTE BookQA Winogrande PiQA BoolQ HellaSwag TruthfulQA (mc1/mc2) Avg Dense 18.1/23.3 44.9/39.7 51.5 17.1/29.0 48.2 66.6 55.0 29.7/34.1 24.1/39.3 37.2 Experts Number N = 8 X-MoE 19.0/24.7 48.3/42.0 52.7 17.4/29.8 50.3 67.9 58.4 31.4/35.7 24.3/40.2 38.7 MH-MoE 19.6/25.2 50.2/42.2 53.0 18.2/30.3 51.1 68.7 59.6 33.2/40.3 24.7/40.9 39.8 Experts Number N = 32 X-MoE 19.4/24.8 50.4/42.5 52.7 17.8/30.0 51.3 68.8 52.8 33.4/40.1 24.3/39.1 39.1 MH-MoE 21.4/26.8 50.6/44.8 53.4 18.8/31.6 53.8 69.3 56.6 35.0/42.1 24.8/39.5 40.6 Table 3. Accuracy / accuracy-normalization scores on multilingual understand- Dense,通常のMoE(X-MoE)と⽐較 3種のデータセットで学習,PPLが改善 → Downstreamタスクでも性能向上 ↓