Ross, J., & Goel, V. (2017, July). Self-critical sequence training for image captioning. CVPR2017. [Li+,2017] Li, J., Monroe, W., & Jurafsky, D. (2017). Learning to Decode for Future Success. In arXiv [cs.CL]. arXiv. [Khandelwal+,2021] Khandelwal, A. (2021). WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue. INLG2021. P.52 [Ziegler+,2019] Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. In arXiv [cs.CL]. arXiv. P.53 [Choshen+,2020] Choshen, L., Fox, L., Aizenbud, Z., & Abend, O. (2020). On the weaknesses of reinforcement learning for neural machine translation. ICLR2020. P.54 [Stiennon+, 2020] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. Learning to summarize from human feedback. NeurIPS2020. P.57 [Xie+,2018] Yujia Xie, et al. A fast proximal point method for computing exact Wasserstein distance. arXiv preprint arXiv 1802.04307, 2018. 参考文献 92/85