with only a few labels (just 20 samples) – Scores of multi-label tasks still have a gap in comparison to the SOTA scores • An unexpected finding here is that BERTFINETUNE alone can improve the result very much even in a small data size – Compare the computational cost of LM fine-tuning vs. data augmentation 20 (Xie+, ’19) Experimental results Main results. The results for text classification are shown in Table 1 with three key observations. • Firstly, UDA consistently improves the performance regardless of the model initialization scheme. Most notably, even when BERT is further finetuned on in-domain data, UDA can still significantly reduce the error rate from 6.50% to 4.20% on IMDb. This result shows that the benefits UDA provides are complementary to that of representation learning. • Secondly, with a significantly smaller amount of supervised examples, UDA can offer decent or even competitive performances compared to the SOTA model trained with full supervised data. In particular, on binary sentiment classification tasks, with only 20 supervised examples, UDA outperforms the previous SOTA trained on full supervised data on IMDb and gets very close on Yelp-2 and Amazon-2. • Finally, we also note that five-category sentiment classification tasks turn out to be much more difficult than their binary counterparts and there still exists a clear gap between UDA with 500 labeled examples per class and BERT trained on the entire supervised set. This suggests a room for further improvement in the future. Fully supervised baseline Datasets IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (# Sup examples) (25k) (560k) (650k) (3.6m) (3m) (560k) Pre-BERT SOTA 4.32 2.16 29.98 3.32 34.81 0.70 BERTLARGE 4.51 1.89 29.32 2.63 34.17 0.64 Semi-supervised setting Initialization UDA IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (20) (20) (2.5k) (20) (2.5k) (140) Random 7 43.27 40.25 50.80 45.39 55.70 41.14 3 25.23 8.33 41.35 16.16 44.19 7.24 BERTBASE 7 27.56 13.60 41.00 26.75 44.09 2.58 3 5.45 2.61 33.80 3.96 38.40 1.33 BERTLARGE 7 11.72 10.55 38.90 15.54 42.30 1.68 3 4.78 2.50 33.54 3.93 37.80 1.09 BERTFINETUNE . 7 6.50 2.94 32.39 12.17 37.32 - 3 4.20 2.05 32.08 3.50 37.12 - Table 1: Error rates on text classification datasets. In the fully supervised settings, the pre-BERT SOTAs include ULMFiT [26] for Yelp-2 and Yelp-5, DPCNN [29] for Amazon-2 and Amazon-5, Mixed VAT [51] for IMDb and DBPedia. Results with different labeled set sizes. We also evaluate the performance of UDA with different numbers of supervised examples. As shown in Figure 4, UDA leads to consistent improvements for all labeled set sizes. In the large-data regime, with the full training set of IMDb, UDA also provides robust gains. On Yelp-2, with 2, 000 examples, UDA outperforms the previous SOTA model trained with 560, 000 examples. 3