BERT • BERT model was significantly undertrained • Longer training, bigger batches, removing NSP, longer sequences, dynamically changing the masking • They conduct some experiments to compare design • decisions:objective, sub-word tokenization, batch size and duration 40 / 41