reserved. 関連⽂献 • Aroyo, Lora, and Chris Welty. "Truth is a lie: Crowd truth and the seven myths of human annotation." AI Magazine 36.1 (2015): 15-24. • Powers, David MW. "What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes." arXiv preprint arXiv:1503.06410 (2015). • Raji, Inioluwa Deborah, et al. "AI and the everything in the whole wide world benchmark." arXiv preprint arXiv:2111.15366 (2021). • Eriksson, Maria, et al. "Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation." arXiv preprint arXiv:2502.06559 (2025). • Chandrasekaran, Jaganmohan, et al. "Test & evaluation best practices for machine learning- enabled systems." arXiv preprint arXiv:2310.06800 (2023). • Liao, Thomas, et al. "Are we learning yet? a meta review of evaluation failures across machine learning." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021.