AI最新論文読み会2022年6月

AI 勉強会2022/6 松山亮佑

自己紹介松山亮佑（りょう） • クラウドチャットサービスの会社でコーポレート IT, 内部統制の仕事をしています • 一児の父🎉 • AI
について ◦ 実家の工場でなんかやりたい ▪ 検品や組立、ピッキング自動化 ◦ 組織のセキュリティを高めていきたい ▪ 脅威の検知や信頼度スコア ◦ 監査を省力化したい ▪ （AI 自体の監査も含めた）解釈可能性・蓋然性、どういう監査 AI なら監査法人が OK 出すのか？ ◦ → とにもかくにも大量の情報処理は AI じゃないとできない！ ▪ 物体検出、異常検知、強化学習？、解釈可能性、自然言語処理等

DeepL https://www.deepl.com/ja/translator お世話になります。

評価用語の整理 Data Science Performance Metrics for Everyone : https://towardsdatascience.com/data-science-performance-metrics-for-everyone-4d68f4859eef 日本語解説
: https://qiita.com/K5K/items/5da52e99861483cae876 Trueが正解 False が正解

1. OPT: Open Pre-trained Transformer Language Models ← PickUp !!
2. PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions 3. Unifying Language Learning Paradigms 4. Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers 5. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning 6. CoCa: Contrastive Captioners are Image-Text Foundation Models 7. A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities 8. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers 9. Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems 10. Building Machine Translation Systems for the Next Thousand Languages

https://arxiv.org/abs/2205.01068v3 大規模な言語モデルは、数十万日の計算時間をかけて学習されることが多く、ゼロショット学習や数ショット学習において顕著な能力を示している。その計算コストを考えると、これらのモデルは大きな資本がなければ再現することが困難です。APIを通じて利用可能な少数のモデルは、完全なモデルの重みにアクセスすることができず、研究を困難にしています。我々は、125Mから175Bのパラメータを持つデコーダのみの事前学習済み変換器群であるOpen Pre-trained Transformers (OPT)を発表し、関心を持つ研究者と完全かつ責任を持って共有することを目的としています。OPT-175BはGPT-3に匹敵する性能を持ちながら、カーボンフットプリントは1/7であることを示しました。また、私たちが直面したインフラストラクチャーの課題を詳述したログブックと、公開されたすべてのモデルの実験用コードも公開されます。
(原文: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.) GitHub - facebookresearch/metaseq: Repo for external large-scale work Meta AI 1. オプトOpen Pre-trained Transformer Language Modelsの略。 (原文: OPT: Open Pre-trained Transformer Language Models)

コードベースと訓練済みモデル https://huggingface.co/facebook/opt-30b https://github.com/facebookresearch/metaseq

性能評価 GPT-3 にくらべて少し under-performe だがタスクによって大きく異なる

バイアス・毒性評価職業、性別、宗教、人種の4つのカテゴリにわたるステレオタイプ・バイアスを測定 1つ目のSaferDialogues は、明示的な安全性の失敗から回復する能力を測定し、通常はその誤りを謝罪または認識する形で行われる。2つ目は、Safety Bench Unit
Tests で、モデルの応答がどの程度安全でないかを、4段階のトピック感度にわたって層別して安全、現実的、安全でない、敵対的の4段階で評価性別、宗教、人種/色、性的指向、年齢、国籍、障害、身体的外観、社会経済的地位の9カテゴリにおける文内レベルのバイアスを測定

バイアス・毒性評価 OPT-175Bが毒性言語で応答する傾向を評価全体として、OPT-175BはPaLMやDavinciよりも毒性率が高いことがわかる。また、3つのモデルともプロンプトの毒性が高くなるにつれて、毒性継続を生成する可能性が高くなることが観察される。事前学習コーパスにはモデリングされていないソーシャルメディアテキストを含める事で有害なテキストに対するモデルの精度が向上したので、それらを生成・検出する傾向が高まると思われる（有害語への強い意識がうまれる）。

ログブック、コード、およびOPT-175Bのモデル重み、ならびにOPT-175Bの設定を反映した一連の小規模ベースラインへのアクセスを研究者に提供することにより、 OPT-175Bのトレーニングに関わるすべての詳細を開示します。日々のトレーニングプロセスの詳細な説明を共有することで、現在のバージョンの OPT-175Bをトレーニングするために使用された計算量だけでなく、基盤となるインフラやトレーニングプロセス自体が大規模で不安定になった場合に必要となる人的オーバーヘッドも開示されるのです。 https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/ OPT175B_Logbook.pdf 情報開示

https://arxiv.org/abs/2204.12511v2 分類問題でディープニューラルネットワークを学習する場合、クロスエントロピー損失とフォーカルロスが最も一般的な選択肢となる。しかし、一般的に言って、良い損失関数はもっと柔軟な形を取ることができ、異なるタスクやデータセットに合わせて調整されるべきである。我々は、テイラー展開によって関数がどのように近似されるかに動機づけられ、損失関数を多項式関数の線形結合として捉え、設計する PolyLossと名付けたシンプルなフレームワークを提案する。この PolyLossは、前述のクロスエントロピー損失と焦点損失を特殊なケースとして自然に包含しながら、異なる多項式ベースの重要性をターゲティングタスクとデータセットに応じて容易に調整することを可能にするものである。広範な実験結果により、 PolyLoss内の最適な選択は、実際にタスクとデータセットに依存することが示された。ハイ
パーパラメータを1つ追加し、コードを1行追加するだけで、我々のポリ1定式化は、2次元画像分類、インスタンス分割、物体検出、 3次元物体検出タスクにおいて、時には大きな差をもってクロスエントロピー損失と焦点損失を上回る性能を発揮する。 (原文: Cross-entropy loss and focal loss are the most common choices when training deep neural networks for classification problems. Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets. Motivated by how functions can be approximated via Taylor expansion, we propose a simple framework, named PolyLoss, to view and design loss functions as a linear combination of polynomial functions. Our PolyLoss allows the importance of different polynomial bases to be easily adjusted depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases. Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependent on the task and dataset. Simply by introducing one extra hyperparameter and adding one line of code, our Poly-1 formulation outperforms the cross-entropy loss and focal loss on 2D image classification, instance segmentation, object detection, and 3D object detection tasks, sometimes by a large margin.) Waymo LLC, Google LLC Table 1: PolyLoss outperforms cross-entropy and focal loss on various models and tasks. Results are for the simplest Poly-1, which has only a single hyperparameter. On ImageNet (Deng et al., 2009), our PolyLoss improves both pre-training and fine-tuning for the recent EfficientNetV2 (Tan & Le, 2021); on COCO (Lin et al., 2014), PolyLoss improves both 2D detection and segmentation AR for Mask-RCNN (He et al., 2017); on Waymo Open Dataset (WOD) (Sun et al., 2020), PolyLoss improves 3D detection AP for the widely used PointPillars (Lang et al., 2019) and the very recent Range Sparse Net (RSN) (Sun et al., 2021). Details are in Table 4, 5, 7. 2. PolyLoss: 分類損失関数の多項式展開の視点 (原文: PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions)

cross-entropy loss function と focal loss function Figure 1. We
propose a novel loss we term the Focal Loss that adds a factor (1 − pt) γ to the standard cross entropy criterion. Setting γ > 0 reduces the relative loss for well-classified examples (pt > .5), putting more focus on hard, misclassified examples. As our experiments will demonstrate, the proposed focal loss enables training highly accurate dense object detectors in the presence of vast numbers of easy background examples. 正解に近い予測値の場合はそれ以上学習する事を抑えることになり、不正解なデータに対しての学習を進めやすくなります。 https://ai-lab-boatrace.xyz/blog/672/ しかし、 focallossは多くの検出タスクに有効であるが、不均衡なImageNet-21Kには最適でないことが今回明らかになった。不均衡なデータセットに強いと思われていたが、精査するとそうでも無いデータセットがあった。 Focal Loss for Dense Object Detection - https://arxiv.org/pdf/1708.02002.pdf

テイラー展開 x(0)におけるこの関数の値、この関数のグラフの傾きの値、 2 階微分の値、 3 階微分の値・・・そういった情報はみんな持っているのだとする。それらの「溢れんばかりの情報」を使って、 x(0)からわずかに離れた x地点での関数の値 f(x)を言い当てることができるか、というのが与え
られたテーマである。（ https://eman-physics.net/math/taylor.html）テイラー級数が収束し、元の関数 f に一致するとき、 f はテイラー展開可能であるという。（https://ja.wikipedia.org/wiki/%E3%83%86%E3%82%A4%E3%83%A9%E3%83%BC%E5%B1%95%E9%96%8B）

ターゲットクラスラベルの予測確率クロスエントロピー損失や焦点損失などの一般的に使用される分類損失関数を、一連の加重多項式基底に分解用途に応じて基底の重みを調整ターゲット（正解）との距離 αj を色々と調整する事で最適な損失関数を見つけるが無限に調整する事は出来ない事から戦略を考える。

PolyLoss フレームワークとして捉える Figure 1: Unified view of cross-entropy loss, focal
loss, and PolyLoss. PolyLoss ∑∞j=1 αj(1 −Pt)j is a more general framework, where Pt stands for prediction probability of the target class. Left: Polyloss is more flexible: it can be steeper (deep red) than cross-entropy loss (black) or flatter (light red) than focal loss (green). Right: Polynomial coefficients of different loss functions in the bases of (1 − Pt )j , where j ∈ Z+ . Black dash lines are drawn to show the trend of polynomial coefficients. In the PolyLoss framework, focal loss can only shift the polynomial coefficients horizontally (green arrow), see Equation 2, whereas the proposed PolyLoss framework is more general, which also allows vertical adjustment (red arrows) of the polynomial coefficient for each polynomial term. これだけ損失関数の幅が拡がる！ • クラスインバランスに対して Focall Loss よりよい損失関数を発見できる • ラベルノイズに頑健な損失 • 損失関数の学習

Table 2: Comparing different losses in the PolyLoss framework. Dropping
higher order polynomial, proposed in prior works, truncates all higher order (N + 1 → ∞) polynomial terms. We propose Poly-N loss, which perturbs the leading N polynomial coefficients. Poly-1 is the final loss formulation, which further simplifies Poly-N and only requires a simple grid search over one hyperparameter. The differences compared to cross-entropy loss are highlighted in red. 最適な多項式係数についてこれだけ

Figure 3: The first polynomial plays an important role for
training ResNet-50 on ImageNet- 1K. (a) Increasing the coefficient of the first polynomial term (ε1 > 0) consistently improves the ResNet50 prediction accuracy. Red dash line shows the accuracy when using cross-entropy loss. Mean and stdev of three runs are plotted. (b) The first polynomial (1−Pt) contributes more than half of the cross-entropy gradient at the last 65% of the training steps, which highlights the importance of tuning the first polynomial. The red dash line shows the crossover. Poly-1 Loss

摂動  せつどう  perturbation  数学、物理学、天文学の問題において、主要部分は正確に解けるが、これに小さい付加項が加わった全体の問題が正確には解けない場合がある。主要部分に付加項による小さい補正が加わったものとみなして、全体の問題を近似的に解く場合に、付加項を摂動といい、その解法を摂動論という。天体力学、量子力学、場の量子論などでは、運動方程式を解く有効な近似法として摂動論がよく用いられる。 https://kotobank.jp/word/%E6%91%82%E5%8B%95-87386#:~:text=%E6%91%82%E5%8B%95%E3%80%8D%E3%81%AE%E8%A7%A3%E8%AA%AC-,%E3%81 %9B%E3%81%A4%E2%80%90%E3%81%A9%E3%81%86%E3%80%90%E6%91%82%E5%8B%95%E3%80%91,%E3%81%95%E3%82%8C%E3%82%8B%E3 %81%93%E3%81%A8%E3%82%92%E3%81%84%E3%81%86%E3%80%82
Perturbation

Table 4: PolyLoss improves classification accuracy on ImageNet validation set.
We set ε1 = 2 for both. Figure 5: PolyLoss improves EfficientNetV2-L by increasing prediction confidence Pt. Figure 4: PolyLoss improves EfficientNetV2 family on the speed-accuracy Pareto curve. Validation accuracy of EfficientNetV2 models pretrained on ImageNet-21K are plotted. Poly-Loss outperforms cross-entropy loss with about x2 speed-up. Poly-1 Loss - 実験結果（画像分類）

Figure 6: PolyLoss improves Mask R-CNN by lowering overconfident predictions.
Mean and stdev of three runs are plotted. PolyLoss に置き換え Poly-1 Loss - 実験結果（セグメンテーション, 物体検出） Table 5: PolyLoss improves detection results on COCO validation set. Bounding box and instance segmentation mask average-precision (AP) and average-recall (AR) are reported for Mask R-CNN model with a ResNet-50 backbone. Mean and stdev of three runs are reported. → εの最適値が違うので、データセットやタスクに合わせて損失関数を調整することが重要である。

Figure 7: Visualizing LFL Poly-1 and LFL Poly-1∗ in the
PolyLoss framework. Poly-1 Loss - 実験結果（3D 物体検出） Table 6: PolyLoss vs. focal loss for 3D detection models. Differences are highlighted in red. We found the best Poly-1 for PointPillars is ε1 = −1, which is equivalent to dropping the first term. Therefore, for RSN, we drop the first term and tune the new leading polynomial (1 − Pt)γ+2. Table 7: PolyLoss improves detection results on Waymo Open Dataset validation set. Two detection models: single-stage PointPillars (Lang et al., 2019) and two-stage SOTA RSN (Sun et al., 2021) are evaluated. Bird’s eye view (BEV) and 3D detection average precision (AP) and average precision with heading (APH) at Level 1 (L1) and Level 2 (L2) difficulties are reported. The IoU threshold is set to 0.7 for vehicle detection and 0.5 for pedestrian detection. RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection PointPillars: Fast Encoders for Object Detection from Point Clouds

https://arxiv.org/abs/2205.05131v1 既存の事前学習済みモデルは、一般的に特定の問題クラスに向けられている。現在までのところ、正しいアーキテクチャと事前学習セットアップがどうあるべきかについて、まだコンセンサスが得られていないようである。本論文では、データセットやセットアップによらず普遍的に有効なモデルを事前学習するための統一的なフレームワークを提示する。まず、一般的に混同されがちなアーキテクチャの原型と事前学習の目的を切り離すことから始める。次に、自然言語処理における自己監視の一般化・統一化の視点を提示し、異なる事前学習目的がどのように互いにキャストされ、どのように異なる目的間の補間が効果的であるのかを明らかにする。そして、多様な事前学習パラダイムを統合した事前学習目的である Mixture-of-Denoisers (MoD)を提案する。さらに、モードスイッチングという概念を導入し、下流の微調整を特定の事前学習スキームと関連付ける。我々は、複数の事前学習目標を比較するために広範な切除実験を行い、我々の方法が複数の多様な設定において T5やGPTに似たモデルを凌駕し、パレートフロンティアを押し上げることを見出した。最後に、我々のモデルを
20Bパラメータまで拡張することにより、言語生成（自動評価と人間による評価を含む）、言語理解、テキスト分類、質問応答、常識的推論、長文推論、構造化知識接地、情報検索に及ぶ 50の確立された教師付き NLPタスクにおいて SOTA性能を達成する。また、文脈内学習においても、ゼロショット SuperGLUEで175B GPT-3を上回り、ワンショット要約で T5-XXLの3倍の性能を達成しました。 20BモデルのFlaxベースのT5Xモデルのチェックポイントは、 \url{https://github.com/google-research/google-research/tree/master/ul2} で公開されています。 (原文: Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. We release Flax-based T5X model checkpoints for the 20B model at \url{https://github.com/google-research/google-research/tree/master/ul2}.) GoogleResearch 3. 言語学習パラダイムの統一 (原文: Unifying Language Learning Paradigms)

スパンとは一単語以上から構成される言語単位（スパン間の類似性に基づく事例ベース構造予測https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/D1-1.pdf） span（スパン）

Figure 2: Schematic of the objective we use in our
baseline model. In this example, we process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an ×) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <X> and <Y>) that is unique over the example. Since “for” and “inviting” occur consecutively, they are replaced by a single sentinel <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>. Denoiser • masked language modeling の改変。 • T5 にて使用される。 ◦ 元論文 : Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer （https://arxiv.org/abs/1910.10683） ◦ 解説 : 【深層学習】T5 - 入出力をテキストにする Transformer の新利用法【ディープラーニングの世界vol.37】（https://youtu.be/-x08lNz3Qfo?t=934）

Mixture of denoisers

Table 8: Summary of UL20B results compared to state-of-the-art. (l)
denotes leaderboard submission. (♯) denotes the best published we could find on the leaderboard. (e) denotes SOTA used an ensembled approach. Because we evaluate finetuning and in-context trade-offs for SuperGLUE, SuperGLUE scores have their own dedicated section below. 評価

Table 10: Results on zero-shot learning on SuperGLUE dataset. We
compare with GPT-3, GLaM and PaLM (Chowdhery et al., 2022). We also include models that are relatively compute-matched with UL20B such as T5-XXL with LM adaptation (Lester et al., 2021), GPT-3 13B and GLaM-8B dense. Notably, UL20B outperforms GPT-3 175B and all other models in a similar compute class on average score. Table 9: Results on SuperGLUE dev set. We compare with T5-11B (Raffel et al., 2019), ST-MoE-32B (Zoph et al., 2022) and PaLM-8B, PaLM-62B and PaLM-540B (Chowdhery et al., 2022). Scores reported are the peak validation scores per task. 評価 Text Summarization with Pretrained Encoders（https://arxiv.org/abs/1908.08345） Table 4: ROUGE F1 results on the XSum test set. Results for comparison systems are taken from the authors’ respective papers or obtained on our data by running publicly released software.

https://arxiv.org/abs/2205.05055v2 大規模な変換器ベースの言語モデルは、明示的に学習させることなく、 few-shot learning（in-context learning とも呼ばれる）を実行することが可能である。我々は、自然言語が持つ特定の分布特性を利用することで、数撃ちゃ当たるのメタ学習（数撃ちゃ当たる学習）と標準的な教師付き学習（漸進的な重み付け学習）の間を補間するような現象が起きると仮定した。また、このような分布の特性は、言語以外の領域における創発的な数発学習につながる可能性があると考えた。この考えに触発され、我々は標準的な画像ベースの少数ショットデータセットで一連の実験を行った。その結果、いくつかのデータ特性が、変換器モデルにおける少数ショット学習の創発を実際に促進することが分かった。これらの特性は全て自然言語に存在するものであり、バースト性、ロングテール性、多対一または一対多のラベルマッピングなどである。このデータは、モデルが少数点学習と重みの情報を記憶することのどちらに偏っているかに影響し、一般にモデルは
どちらか一方のみでうまく機能することができる。しかし、我々は、この 2つの能力を同じモデルで共存させることができる追加的な分布特性（クラスに対する歪んだ Zipfian分布）を発見した。また、変換器では少数点学習が可能であった学習データが、リカレントモデルでは少数点学習が不可能であったことも注目すべき点である。つまり、数列学習は正しいデータ分布に正しいアーキテクチャを適用することによってのみ出現し、どちらの要素も単独では不十分であることがわかった。 (原文: Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language -- burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model -- a skewed, Zipfian distribution over classes -- which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own.) DeepMind 4. データ分布の特性がトランスフォーマーにおける創発的な few-shot learning を促進する (原文: Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers)

context window（文脈窓）文脈窓は、各単語の文脈を決定するために使用する単語の数を指定します。例えば、「 the quick brown fox」という文の場合、文脈窓が2つあれば、サンプルは(the,quick)と( the, brown)のようになります。次に、単語を1つスライドさせると、サンプルは(quick,
the )、(quick, brown)、(quick fox)といった具合になります。この単語 2vecのチュートリアルを読んで、学習方法と用語を理解することをお勧めします。 https://datascience.stackexchange.com/questions/16424/what-is-context-window-size

omniglot(data set) Mnist より厳し目な文字画像データセット（https://github.com/brendenlake/omniglot）

実験方法

Causal Transformer Causal Transformer for Estimating Counterfactual Outcomes（https://arxiv.org/abs/2204.07258）

few-shot learning と in-weight memorization との一貫したトレードオフ

多義性 "bank "という単語は、「金融機関」という意味と「盛り上がった土手（川の堤防）」という意味を持つことがある。この現象に類似して、各画像が複数のラベルを持ちうるデータセットで学習を行った。

適度なばらつきがバースト性をうむ

パレートの法則、べき乗則と同じ意味ジップの法則（https://ja.wikipedia.org/wiki/%E3%82%B8%E3%8 3%83%E3%83%97%E3%81%AE%E6%B3%95% E5%89%87）ジップの法則＿べき乗則＿パレートの法則＿ロングテール（ https://yyhhyy.hatenablog.com/entry/2015/08/30/21 0000 ）
自然言語における単語の分布は Zipfian distribution に従う

Figure 7 | Few-shot learning in transformers vs recurrent architectures.
We compare architectures while holding fixed the number of layers, hidden layer size, and number of parameters. Only a transformer is able to attain few-shot learning; the Vanilla RNN and LSTM never perform above chance. One run was performed for each set of hyperparameters in a hyperparameter sweep. Transformer 固有の現象

5. ディープラーニングのための画像補強技術の包括的なサーベイ (原文: A Comprehensive Survey of Image Augmentation Techniques
for Deep Learning) https://arxiv.org/abs/2205.01491v1 ディープラーニングは、大量の画像を必要とするコンピュータビジョンにおいて適切な性能を達成しているが、画像の収集は多くのシナリオで高価で困難である。この問題を軽減するために、効果的かつ効率的な戦略として、多くの画像補強アルゴリズムが提案されている。現在のアルゴリズムを理解することは、与えられたタスクに適した方法を見つけたり、新しい技術を開発したりするために不可欠である。本論文では、深層学習のための画像補強について、新しい情報的分類法を用いて包括的なサーベイを実施する。なぜ画像補強が必要なのか、その基本的な考え方を知るために、コンピュータビジョンのタスクと周辺分布における課題を紹介する。次に、アルゴリズムをモデルフリー、モデルベース、最適化ポリシーベースの 3つのカテゴリに分類する。モデルフリーのカテゴリは画像処理手法を採用し、モデルベースは学習可能な画像生成モデルを活用する。一方、最適化ポリシーベースは、最適な操作やその組み合わせを見つけることを目的としている。さらに、群やカーネル理論といった画像補強の異なる理解方法を活用し、教師なし学習のための画像補強を展開するという、より活発な 2つのトピックで一般的なアプリケーションの現在の傾向について議論する。これらの分析に基づき、我々の調査は、実用的なアプリケーションのために適切な方法を選択したり、新しいアルゴリズムを設計するのに役立つ、より良い理解を与えるものと信じています。 (原文: Deep learning has been achieving decent performance in computer vision requiring a large volume of images, however, collecting images is expensive and difficult in many scenarios. To alleviate this issue, many image augmentation algorithms have been proposed as effective and efficient strategies. Understanding current algorithms is essential to find suitable methods or develop novel techniques for given tasks. In this paper, we perform a comprehensive survey on image augmentation for deep learning with a novel informative taxonomy. To get the basic idea why we need image augmentation, we introduce the challenges in computer vision tasks and vicinity distribution. Then, the algorithms are split into three categories; model-free, model-based, and optimizing policy-based. The model-free category employs image processing methods while the model-based method leverages trainable image generation models. In contrast, the optimizing policy-based approach aims to find the optimal operations or their combinations. Furthermore, we discuss the current trend of common applications with two more active topics, leveraging different ways to understand image augmentation, such as group and kernel theory, and deploying image augmentation for unsupervised learning. Based on the analysis, we believe that our survey gives a better understanding helpful to choose suitable methods or design novel algorithms for practical applications.)

https://arxiv.org/abs/2205.01917v1 コンピュータビジョンにおいて、大規模な事前学習済み基礎モデルの探索は、これらのモデルが多くの下流タスクに迅速に転用できるため、大きな関心を集めている。本論文では、コントラスト損失とキャプション損失と合同で画像 -テキストエンコーダデコーダの基礎モデルを事前学習する最小限の設計であるコントラストキャプショナ（ CoCa）を紹介し、それによって CLIPのようなコントラストアプローチと SimVLMのような生成的手法からのモデル能力を包含する。標準的なエンコーダ・デコーダ変換器では、全てのデコーダ層がエンコーダ出力に注目するが、 CoCaではデコーダ層の前半で交差注目を省略し、単峰性テキスト表現をエンコードし、残りのデコーダ層はマルチモーダル画像テキスト表現のために画像エンコーダへ交差注目をカスケードしている。マルチモーダルデコーダの出力には、テキストトークンを自己回帰的に予測するキャ
プションロスに加え、ユニモーダル画像とテキスト埋め込み間のコントラストロスを適用する。同じ計算グラフを共有することで、 2つの学習目的は最小限のオーバーヘッドで効率的に計算されます。 CoCaは、Web スケールの alt-textデータと注釈付き画像の両方に対して、全てのラベルを単にテキストとして扱い、表現学習のための自然言語監視をシームレスに統合することにより、エンドツーエンドかつゼロから事前学習される。経験的に、 CoCaは視覚認識（ ImageNet, Kinetics-400/600/700, Moments-in-Time）、クロスモーダル検索（ MSCOCO, Flickr30K, MSR-VTT）、マルチモーダル理解（ VQA, SNLI-VE, NLVR2）、画像キャプション（MSCOCO, NoCaps）に及ぶ幅広い下流タスクに対してゼロショット転送もしくはタスク固有の最小限の適合で最先端の性能を達成することができました。特に ImageNetの分類では、 CoCaはゼロショットで 86.3%のトップ1精度を達成し、フリーズしたエンコーダと学習した分類ヘッドで 90.6%、微調整したエンコーダで ImageNetにおける最新鋭のトップ 1精度を91.0%達成しました。 (原文: Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.) Google Research 6. CoCa：対照的なキャプションは画像とテキストの基礎モデル (原文: CoCa: Contrastive Captioners are Image-Text Foundation Models)

Encoder-Decoder Captioning 3 つの学習をおさらい Single-Encoder Classification Dual-Encoder Contrastive Learning 自然言語学習を利用する
3 つの基礎モデルファミリー

Contrastive Captioners(CoCa) Figure 1: Overview of Contrastive Captioners (CoCa) pretraining
as image-text foundation models. The pretrained CoCa can be used for downstream tasks including visual recognition, vision-language alignment, image captioning and multimodal understanding with zero-shot transfer, frozen-feature evaluation or end-to-end finetuning. Figure 2: Detailed illustration of CoCa architecture and training objectives. λConとλCapは損失重み付けハイパーパラメータ

Contrastive Captioners(CoCa) タスクとしては視覚認識、クロスモーダルアライメント、画像キャプションとマルチモーダル理解

Figure 2: An illustration of our proposed transformer architecture for
learning multi-scale features with cross- attention (CrossViT). Our architecture consists of a stack of K multi-scale transformer encoders. Each multi-scale transformer encoder uses two different branches to process image tokens of different sizes (Ps and Pl, Ps < Pl) and fuse the tokens at the end by an efficient module based on cross attention of the CLS tokens. Our design includes different numbers of regular transformer encoders in the two branches (i.e. N and M) to balance computational costs. Cross-attention Figure 3: Multi-scale fusion. (a) All-attention fusion where all tokens are bundled together without considering any charac- teristic of tokens. (b) Class token fusion, where only CLS tokens are fused as it can be considered as global representation of one branch. (c) Pairwise fusion, where tokens at the corresponding spatial locations are fused together and CLS are fused separately. (d) Cross-attention, where CLS token from one branch and patch tokens from another branch are fused together. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification （https://arxiv.org/abs/2103.14899v2）

マルチモーダルへの対応 https://arxiv.org/abs/2103.03206 Cross-attention（応用） Figure 1. The Perceiver is an architecture
based on attentional principles that scales to high-dimensional inputs such as images, videos, audio, point-clouds, and multimodal combinations without making domain-specific assumptions. The Perceiver uses a cross-attention module to project an high-dimensional input byte array to a fixed-dimensional latent bottleneck (the number of input indices M is much larger than the number of latent indices N ) before processing it using a deep stack of Transformer-style self-attention blocks in the latent space. The Perceiver iteratively attends to the input byte array by alternating cross-attention and latent self-attention blocks. Figure 2. We train the Perceiver architecture on images from ImageNet (Deng et al., 2009) (left), video and audio from AudioSet (Gemmeke et al., 2017) (considered both multi- and uni-modally) (center), and 3D point clouds from ModelNet40 (Wu et al., 2015) (right). Essentially no architectural changes are required to use the model on a diverse range of input data.

行動認識において Attention 機構を応用するための技法で、加重平均プーリングとも見なすことができる。一般的な CNN 最終層の平均プーリング層を置き換えるだけで良い。 Attentional Pooling for Action
Recognition https://arxiv.org/abs/1711.01467 attentional-pooling

実験結果 3秒動画 10秒動画すべて 1 つの事前学習されたチェックポイントから転移学習をした。

実験結果パラメータの数が少なくて済む

実験結果

7. 数撃ちゃ当たるの総合的な調査。進化、応用、課題、そして機会 (原文: A Comprehensive Survey of Few-shot Learning: Evolution,
Applications, Challenges, and Opportunities) https://arxiv.org/abs/2205.06743v1 Few-shot learning (FSL) は効果的な学習法として注目されており、大きな可能性を持っている。近年、 FSLタスクに対する創造的な研究が行われているが、わずか数個、あるいはゼロ個のサンプルから有効な情報を迅速に学習することは、依然として深刻な課題である。そこで我々は、過去 3年間に発表された200以上の FSLに関する最新論文を幅広く調査し、 FSLの最新の進歩の概要と、既存論文の長所と短所の公平な比較をタイムリーに包括的に提示することを目的とした。概念の混乱を避けるため、我々はまず、少数点学習、転移学習、メタ学習を含む一連の類似した概念を精緻化し、比較する。さらに、 FSLの課題に応じて、知識の抽象化レベルに応じて既存の研究を分類する新しい分類法を提案する。このサーベイを充実させるために、各サブセクションにおいて、これらのトピックに関する最近の進歩についての詳細な分析と洞察に満ちた議論を提供する。さらに、コンピュータビジョンを例として、 FSLの重要なアプリケーションを強調し、様々な研究のホットスポットをカバーする。最後に、技術進化のトレンドと将来の研究機会に関するユニークな洞察で調査を締めくくり、後続の研究に指針を与えることを期待する。 (原文: Few-shot learning (FSL) has emerged as an effective learning method and shows great potential. Despite the recent creative works in tackling FSL tasks, learning valid information rapidly from just a few or even zero samples still remains a serious challenge. In this context, we extensively investigated 200+ latest papers on FSL published in the past three years, aiming to present a timely and comprehensive overview of the most recent advances in FSL along with impartial comparisons of the strengths and weaknesses of the existing works. For the sake of avoiding conceptual confusion, we first elaborate and compare a set of similar concepts including few-shot learning, transfer learning, and meta-learning. Furthermore, we propose a novel taxonomy to classify the existing work according to the level of abstraction of knowledge in accordance with the challenges of FSL. To enrich this survey, in each subsection we provide in-depth analysis and insightful discussion about recent advances on these topics. Moreover, taking computer vision as an example, we highlight the important application of FSL, covering various research hotspots. Finally, we conclude the survey with unique insights into the technology evolution trends together with potential future research opportunities in the hope of providing guidance to follow-up research.)

https://arxiv.org/abs/2204.14217v1 変換器を用いたテキストから画像への変換モデルの開発は、その生成の遅さと高解像度画像に対する複雑さによって妨げられている。本研究では、階層的な変換器と局所的な並列自己回帰生成に基づく解決策を提唱する。簡単で柔軟な自己教師付きタスクであるクロスモーダル一般言語モデル（ CogLM）を用いて6Bパラメータの変換器を事前学習し、高速な超解像のためにそれを微調整する。CogView2 は、同時並行の最先端技術である DALL-E-2 と比較して、非常に高い生成能力を示し、画像上での対話的なテキストガイド編集を自然にサポートします。
(原文: The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.) https://github.com/THUDM/CogView2 Tsinghua University, BAAI 8. CogView2：階層型トランスフォーマーによるテキストから画像への変換の高速化・高品質化 (原文: CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers)

Non-auto regressive generation

Ice Tokenizer 中国語、英語、画像を同じ空間に埋め込める https://github.com/THUDM/icetk

Figure 4: Super-resolution modules. The low-resolution images are mapped into
high-resolution images via the direct super-resolution module. In each snapshot during the iterative super-resolution, the tokens in the same color are generated at the same time. All the local windows work in parallel. 階層的生成

Table 1: Machine Evaluation Results on MS-COCO. (Downsampling CogView2 images
to 256×256.) * means finetuning on MS-COCO. 実験結果と優位性 FID = Fréchet Inception Distance および IS = Inception Scores については http://www.cv.info.gifu-u.ac.jp/contents/workshop/contents/eccv2018/ppt/how_good_is_my_gan.pdf DALL-E-2 に比べて少ない訓練データセット数で訓練した。拡散モデルに比べて時間コストを削減することが可能である。（高い並列度で画像をアップサンプリングできるため、（潜在的には）拡散モデルよりもはるかに高速にモデル設計できる階層を多く導入する）

https://arxiv.org/abs/2205.01128v1 20世紀から21世紀にかけてのAIの劇的な進歩を説明するものは何か、そして現在の AIに残された限界を克服する方法は何か？一般に、このような進歩は、深い人工ニューラルネットワークにおける統計的学習をサポートするための計算機とデータ資源の大量な増加に起因すると考えられている。私たちは、さらに重要な要因として、新しいタイプの計算の発展があることを示す。ニューロコンポジション・コンピューティングは、人間レベルの認知を可能にするために同時に尊重されるべき2つの原則、すなわち「構成性」と「連続性」の原則を採用している。この 2つは両立しないように思われたが、最近、構成性は離散的な記号計算の手法だけでなく、新しい形式の連続的な神経計算によっても実現できることが数学的に発見された。最近の AIの革命的な進歩は、限られた形態のニューロコンポジション・コンピューティングの使用によってもたらされたものである。ニューロコンポジショナル・コンピューティングの新しい、より深い形態は、より頑健で、正確で、理解しやす
いAIシステムを創り出す。 (原文: What explains the dramatic progress from 20th-century to 21st-century AI, and how can the remaining limitations of current AI be overcome? The widely accepted narrative attributes this progress to massive increases in the quantity of computational and data resources available to support statistical learning in deep artificial neural networks. We show that an additional crucial factor is the development of a new type of computation. Neurocompositional computing adopts two principles that must be simultaneously respected to enable human-level cognition: the principles of Compositionality and Continuity. These have seemed irreconcilable until the recent mathematical discovery that compositionality can be realized not only through discrete methods of symbolic computing, but also through novel forms of continuous neural computing. The revolutionary recent progress in AI has resulted from the use of limited forms of neurocompositional computing. New, deeper forms of neurocompositional computing create AI systems that are more robust, accurate, and comprehensible.) Johns Hopkins University Microsoft Research Northwestern University 9. ニューロコンポジション・コンピューティング。認知のセントラルパラドックスから新世代の AIシステムへ (原文: Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems)

構成性（Compositionality）構成性の原理とは、意味のある文において、語彙の部分を取り出せば、残るのは構成則になるというものである。例えば、「ソクラテスは人であった」という文を考えてみよう。ソクラテス」と「男」という意味のある語彙を取り去ると、残るのは「 SはMであった」という擬似文である。課題は、SとMの間にどのようなつながりがあるかを記述することである。意味論、数理論理学、および関連分野では、構成性の原則は、複雑な式の意味が、その構成式の意味とそれらを組み合わせるために使用される規則によって決定されるという原則

認知のセントラル・パラドックス • 連続性の原理を尊重する最近のニューラルコンピューティングの例。 ◦ ニューラルコンピューティング : 情報を数値活性化ベクトルに符号化する • a
: 情報が数値活性化ベクトル（ニューロンのグループまたは層にわたる活性化のパターン）に符号化 • b : 例えば、標準的なニューラル・システムはlockという単語に関する知識を fastenという単語に容易に汎化できる • c : ベクトル空間の構造は、文のペアの間の系統的な違いを表す一貫したオフセットを持つなど、他のタイプの関係も捉えることができます • 構成性の原理を尊重する従来のコンピューティングの例。 ◦ 構成性とは部分の結果を合成する事で全体の結果を得られる事で、つまり問題を単純化して解く事が出来る。リソースが削減でき、新規の状況にも対応出来る。 • c : 例えば、雨が降ったら車を運転する（q→r）、運転したら車を充電する必要がある（r→s）を知っていれば、雨が降ったら車を充電する必要がある（q→s）と結論づけることができる。図1と図2の対比が示すように、離散的な記号的構成構造処理コンピュータと連続的な神経コンピュータは大きく異なっている。しかし、どういうわけか、私たちの頭の中のコンピュータは、神経コンピュータであると同時に構成構造コンピュータでもあるらしい。これはどうしてなのだろう？我々はこれを「認知のセントラル・パラドックス」と呼んでいる。

タスクデータからタスクに最適な連続エンコーディングを学習する (類似度に基づく汎化を行う）データの中にある統計的なパターンを利用形式領域における、明示的な構成表現からの良好な構成一般化明瞭さ不安定な構成一般化わかりにくさ非形式的な領域では、人間が設計した離散的な構
造やルールは、しばしばデータにうまく適合しないことがある指数関数的に巨大な合成構造候補空間に対する難解な探索を行う

n-in-n task ニューロコンポジショナリティの実装が深くなればなるほど、構成的汎化がますます強固になる。例えば、⟨3,9,7,4,7⟩のような5桁の数字の列を入力とし、その列全体を内部で符号化し、その列を出力として再生するという非常に簡単な課題を考える。このコピー課題では、 4という数字が入力のnの位置に現れたら、出力のnの位置にも4が配置されなければならないことを、可能なすべてのnの位置について学習する必要があるため、構成的汎化が要求される。同様に、 nという数字が入力の4の位置に現れたら、出力の4の位置にも、可能なすべてのnの数字についてnが
配置されなければならないことを学習する。このような構成的汎化をテストするために、ある種のシーケンス、すなわち、 1の位置に1がある、あるいは2の位置に2がある、などの「n-in-n」シーケンスをすべて学習から除外する。n-in-nの配列は、nの位置にnの数字が来ることはないので、訓練後、その配列を正しくコピーするためには構成的汎化が必要です。

neurocompositional computing https://www.microsoft.com/en-us/research/uploads/prod/2022/04/Neurocompositional_computing_tutorial__MSR_TR_submitted.pdf • AI の最近の進歩は単に定量的な技術的進歩に起因するだけでなく、認識されていないニューロコンポジショナルコンピューティングの出現にもよるものだ • 現在の
AI が直面する問題は 3 つ。構成的汎化の限界、学習における極度のデータ非効率性、解釈可能性。 • 現代的 AI システムの基本である「連続性の原則」に従来の AI システムの基本である「構成性の原則」を加える。 ◦ 実数で形式化されているという点で、符号化とそれに対する演算は連続性原理を尊重している。 ◦ しかし構成性の原則も人間の認知において重要である ▪ 構成的汎化とデータ非効率性の課題に取り組む鍵である • Transformer アーキテクチャは 1G ニューロコンポジショナルコンピューティングである。 ◦ シーケンシャルな構造とネットワーク構造（グラフ）の組み合わせ

NECST（Neurally-Encoded Composi-tionally-Structured Tensor）合成構造のテンソル積表現（TPR） filler と role をニューラルネットワークで連続ベクトルに符号化し、テンソル積を求める。
このようにして得られた神経エンコーディングは、合成構造のテンソル積表現（TPR）と呼ばれる。そのため、 Neurally-Encoded Compositionally-Structured Tensor (NECST)コンピューティングと呼ばれる。 filler-role 結合

10. 次の千の言語のための機械翻訳システムの構築 (原文: Building Machine Translation Systems for the Next
Thousand Languages) https://arxiv.org/abs/2205.03983v2 この論文では、1000以上の言語を翻訳できる実用的な機械翻訳（ MT）システムを構築するための取り組みから得られた結果を紹介する。本論文では、 3つの研究領域における成果を紹介する。(i) 半教師付き事前学習を利用して言語を識別し、データ駆動型のフィルタリング技術を開発することにより、 1500以上の言語のクリーンなWebマイニングデータセットを構築する。 (ii) 100以上の高リソース言語の教師付き並列データで学習した大規模多言語モデルと、さらに 1000以上の言語の単言語データを利用して、サービスが不十分な言語向けの実践的 MTモデルを開発する。我々の研究が、現在あまり研究されていない言語の MTシステム構築に取り組む実務家に有益な洞察を提供し、データが乏しい環境における大規模多言語モデルの弱点を補完する研究の方向性を明らかにすることを期待している。 (原文: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.) Google Research 「高リソース言語」を用いて「ロングテール言語」翻訳タスクに応用する

N-gramとは、テキスト内のある言語単位（文字や形態素、品詞など）が 2言語単位、3言語単位など一般に N言語単位が隣接して生じる言語単位の共起関係（collocation）（それぞれ、2グラム、3グラムおよびNグラムという）で、文書の特長の一端を示すものと考えることができる。 https://www.isc.meiji.ac.jp/~mizutani/mining/n_gram.html#:~:text=N%2Dgram%E3%81%A8%E3%81%AF%E3%80%81%E3%83%86% E3%82%AD%E3%82%B9%E3%83%88,%E3%81%A8%E8%80%83%E3%81%88%E3%82%8B%E3%81%93%E3%81%A8%E3%81 %8C%E3%81%A7%E3%81%8D%E3%82%8B%E3%80%82 • 共起関係の頻度はべき乗分布に従うため同じ関係を持つ言語に出逢うと誤分類が起きやすい。 •
恣意的に共起関係を繰り返されるとクラス確率が上昇してしまう。 • インターネットにはスペースで区切られたテキストがよくあり、ノイズとなってしまう。 n-gram

フィルタリング

https://arxiv.org/pdf/2010.14571.pdf? フィルタリング

フィルタリング TF-iif フィルタリングを改良

モデル 100以上の高リソース言語の教師付き並列データで学習した大規模多言語モデルと、さらに1000以上の言語の単言語データを利用して、サービスが不十分な言語向けの実践的MTモデルを開発 • 教師あり多言語NMTと単言語データおよび自己教師あり学習を組み合わせたコ・トレーニングメカニズム ◦ 教師あり多言語NMT →
TransformerベースのMTモデル ◦ 単言語データおよび自己教師あり学習 → Masked Sequence-to-Sequence (MASS)

評価 CHRF対BLEUの性能（4.2）や人間による評価（4.4）など、より定量的な方法からモデルの分析を行います。また、RTTLANGIDCHRFは低リソース言語のために開発された参照不要の指標であり、CHRFと適度な相関があります。次に、モデル出力の定性的分析を行い、分布的に類似した単語と「トラ」と「ミニチュアワニ」のような概念の混同（4.5）、単一単語入力でのエラー（4.7）、蒸留モデルでのエラーモードの拡大の検討（4.8）などいくつかのエラーパターンを強調しました。

AI最新論文読み会2022年6月

AI最新論文読み会2022年6月

More Decks by 医療AI研究所@大阪公立大学

Other Decks in Research

Featured

Transcript