• Stability-AI. (2023), "JP Language Model Evaluation Harness", https://github.com/Stability-AI/lm-evaluation- harness/blob/jp-stable/README.md • Weights & Biases Japan. (2023), "Nejumi LLMリーダーボード", https://wandb.ai/wandb/LLM_evaluation_Japan/reports/Nejumi-LLM---Vmlldzo0NTUzMDE2 • Chang et al. (2023), "A survey on evaluation of large language models”, arXiv preprint arXiv:2307.03109. • OpenAI. (2023), "GPT-4 Technical Report", ArXiv abs/2303.08774. • Google. (2023), “PaLM 2 Technical Report”, ArXiv abs/2305.10403. • Big-bench authors. (2022), "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models", ArXiv abs/2206.04615. • YuzuAI. (2023), "The Rakuda Ranking of Japanese AI", https://yuzuai.jp/benchmark • Labrak et al. (2023), “A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks”, ArXiv abs/2307.12114. • Zhuo et al. (2023). "On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex", ArXiv abs/2301.12868.