different tasks on examples with is also able to, in principle, learn the l. (2018) without the need for explicit h symbols are the outputs to be pre- rvised objective is the the same as the e but only evaluated on a subset of the minimum of the unsupervised objective imum of the supervised objective. In g, the concerns with density estimation ng objective discussed in (Sutskever tepped. The problem instead becomes to, in practice, optimize the unsuper- nvergence. Preliminary experiments ently large language models are able to ”I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool]. In a now-deleted post from Aug. 16, Soheil Eid, Tory candidate in the riding of Joliette, wrote in French: ”Mentez mentez, il en restera toujours quelque chose,” which translates as, ”Lie lie and something will always remain.” “I hate the word ‘perfume,”’ Burr says. ‘It’s somewhat better in French: ‘parfum.’ If listened carefully at 29:55, a conversation can be heard between two guys in French: “-Comment on fait pour aller de l’autre cot´ e? -Quel autre cot´ e?”, which means “- How do you get to the other side? - What side?”. If this sounds like a bit of a stretch, consider this ques- tion in French: As-tu aller au cin´ ema?, or Did you go to the movies?, which literally translates as Have-you to go to movies/theater? 例: 英語 ⇒ フランス語 以下のようなテキストがWebには たくさんある → ここから ⾔語モデルを学習 ※ GPT-2の論⽂から引⽤
four methods for performing a task with a language mo and few-shot, which we study in this work, require th time. We typically present the model with a few dozen descriptions, examples and prompts can be found in Ap GPT-2 GPT-3 [task description], ([example],) [prompt]を⾔語モデルで読み、 その後を出⼒していく (パラメータを更新しない)
word, and write that word: r e!c.i p r o.c a/l = Completion ! reciprocal Figure G.26: Evaluation example for Symbol Insertion Context ! Please unscramble the letters into a word, and write that word: taefed = Completion ! defeat Figure G.27: Evaluation example for Reversed Words 56 Symbol Insertion タスク
Size Learning Rate GPT-3 Small 125M 12 768 12 64 0.5M 6.0 ⇥ 10 4 GPT-3 Medium 350M 24 1024 16 64 0.5M 3.0 ⇥ 10 4 GPT-3 Large 760M 24 1536 16 96 0.5M 2.5 ⇥ 10 4 GPT-3 XL 1.3B 24 2048 24 128 1M 2.0 ⇥ 10 4 GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 ⇥ 10 4 GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 ⇥ 10 4 GPT-3 13B 13.0B 40 5140 40 128 2M 1.0 ⇥ 10 4 GPT-3 175B or “GPT-3” 175.0B 96 12288 96 128 3.2M 0.6 ⇥ 10 4 Table 2.1: Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models which we trained. All models were trained for a total of 300 billion tokens. 2.1 Model and Architectures We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a 1GPUだと数⼗〜数百年 (そもそもモデルがGPU メモリに乗らないので 不可能) GPT-2 ≒
the analysis in Scaling Laws For Neural Languag 20] we train much larger models on many fewer tokens than is typical. As a consequence, although G 10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of e-training. Methodology for these calculations can be found in Appendix D. Dataset Quantity (tokens) Weight in training mix Epochs elapsed when training for 300B tokens Common Crawl (filtered) 410 billion 60% 0.44 WebText2 19 billion 22% 2.9 Books1 12 billion 8% 1.9 Books2 55 billion 8% 0.43 Wikipedia 3 billion 3% 3.4 2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples durin rawn from a given dataset, which we intentionally do not make proportional to the size of the data hen we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while othe ess than once. 良質なテキストの重みを⼤きく
RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0 T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5 T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1 GPT-3 Zero-Shot 14.6 14.4 64.3 GPT-3 One-Shot 23.0 25.3 68.0 GPT-3 Few-Shot 29.9 41.5 71.2 Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server. One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact Completion ! henri robert marcel duchamp Completion ! Duchampian Completion ! Duchamp Completion ! duchampian Completion ! marcel du champ Completion ! Marcel Duchamp Completion ! MARCEL DUCHAMP Figure G.34: Evaluation example for TriviaQA. TriviaQA allows for multiple valid completio Context ! Q: What school did burne hogarth establish? A: Completion ! School of Visual Arts Figure G.35: Evaluation example for WebQA Context ! Keinesfalls d¨ urfen diese f¨ ur den kommerziellen Gebrauch verwendet werden Completion ! In no case may they be used for commercial purposes. Figure G.36: Evaluation example for De!En. This is the format for one- and few-shot learning, other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of { {translation}.” 58 関連する⽂書を検索
RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0 T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5 T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1 GPT-3 Zero-Shot 14.6 14.4 64.3 GPT-3 One-Shot 23.0 25.3 68.0 GPT-3 Few-Shot 29.9 41.5 71.2 Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server. One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact Completion ! henri robert marcel duchamp Completion ! Duchampian Completion ! Duchamp Completion ! duchampian Completion ! marcel du champ Completion ! Marcel Duchamp Completion ! MARCEL DUCHAMP Figure G.34: Evaluation example for TriviaQA. TriviaQA allows for multiple valid completio Context ! Q: What school did burne hogarth establish? A: Completion ! School of Visual Arts Figure G.35: Evaluation example for WebQA Context ! Keinesfalls d¨ urfen diese f¨ ur den kommerziellen Gebrauch verwendet werden Completion ! In no case may they be used for commercial purposes. Figure G.36: Evaluation example for De!En. This is the format for one- and few-shot learning, other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of { {translation}.” 58 Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models 関連する⽂書を検索
Average Accuracy Accuracy F1 Accuracy Accuracy Fine-tuned SOTA 89.0 91.0 96.9 93.9 94.8 92.5 Fine-tuned BERT-Large 69.0 77.4 83.6 75.7 70.6 71.7 GPT-3 Few-Shot 71.8 76.4 75.6 52.0 92.0 69.0 WiC WSC MultiRC MultiRC ReCoRD ReCoRD Accuracy Accuracy Accuracy F1a Accuracy F1 Fine-tuned SOTA 76.1 93.8 62.3 88.2 92.5 93.3 Fine-tuned BERT-Large 69.6 64.6 24.1 70.0 71.3 72.0 GPT-3 Few-Shot 49.4 80.1 30.5 75.4 90.2 91.1 Table 3.8: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient updates. 3.6 Reading Comprehension Context ! An outfitter provided everything needed for the safari. Before his first walking holiday, he went to a specialist outfitter to buy some boots. question: Is the word ‘outfitter’ used in the same way in the two sentences above? answer: Completion ! no the sealant. Figure G.4: Evaluation example for PIQA Context ! My body cast a shadow over the grass because Correct Answer ! the sun was rising. Incorrect Answer ! the grass was cut. Figure G.5: Evaluation example for COPA Context ! (CNN) Yuval Rabin, whose father, Yitzhak Rabin, w serving as Prime Minister of Israel, criticized D to "Second Amendment people" in a speech and warn politicians use can incite violence and undermine words are an incitement to the type of political me personally," Rabin wrote in USAToday. He said "Second Amendment people" to stop Hillary Clinton criticized as a call for violence against Clinton -- "were a new level of ugliness in an ugly campa - The son of a former Israeli Prime Minister who op ed about the consequence of violent political - Warns of "parallels" between Israel of the 1990 Correct Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem aggressive rhetoric. Correct Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem rhetoric. Incorrect Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem aggressive rhetoric. Incorrect Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem rhetoric. Incorrect Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem aggressive rhetoric. 若⼲無理やり適⽤しているので できなくても仕⽅ない
with model size and number of examples in context. A value of K = 32 means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in
What is 98 plus 45? A: Completion ! 143 Figure G.44: Evaluation example for Arithmetic 2D+ Context ! Q: What is 95 times 45? A: Completion ! 4275 Figure G.45: Evaluation example for Arithmetic 2Dx Context ! Q: What is 509 minus 488? A: Completion ! 21 Mean accuracy 95% Confidence Interval (low, hi) t compared control (p-v Control (deliberately bad model) 86% 83%–90% - GPT-3 Small 76% 72%–80% 3.9 (2e-4 GPT-3 Medium 61% 58%–65% 10.3 (7e-2 GPT-3 Large 68% 64%–72% 7.3 (3e-1 GPT-3 XL 62% 59%–65% 10.7 (1e-1 GPT-3 2.7B 62% 58%–65% 10.4 (5e-1 GPT-3 6.7B 60% 56%–63% 11.2 (3e-2 GPT-3 13B 55% 52%–58% 15.3 (1e-3 GPT-3 175B 52% 49%–54% 16.9 (1e-3 Table 3.11: Human accuracy in identifying whether short (⇠200 word) news articles find that human accuracy (measured by the ratio of correct assignments to non-neutral assi Title: United Methodists Agree to Historic Split Subtitle: Those who oppose gay marriage will form their own denomination Article: Figure 3.14: The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human written article (accuracy: 12%). Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm Subtitle: Joaquin Phoenix pledged to not change for each awards event Article: Arithmetic News Article Generation ⼈間が書いたかGPT-3が書いたか⾒分けがつかない モデルサイズを⼤きくすると 急に解けるようになる
I eated the purple berries. Good English output: I ate the purple berries. Poor English input: Thank you for picking me as your designer. I’d appreciate it. Good English output: Thank you for choosing me as your designer. I appreciate it. Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications. Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications. Poor English input: I’d be more than happy to work with you in another project. Poor English input: Please provide me with a short brief of the design you’re looking for and that’d be nice if you could share some examples or project you did before. Correcting English Grammar