first <EOS> symbol, and the decoder (right) begins with <SOS>. 200 hidden units per layer, no attention, and dropout applied at the 0.5 level. Although the detailed analyses to follow focus on this particular model, the top-performing ar- chitecture for each experiment individually is also reported and analyzed. Networks were trained with the following specifications. Training consisted of 100,000 trials, each presenting an input/output sequence and then updating the networks weights.5 The ADAM optimization algorithm was used with default parameters, including a learning rate of 0.001 (Kingma & Welling, 2014). Gradients with a norm larger than 5.0 were clipped. Finally, the decoder requires the previous step’s output as the next step’s input, which was computed in two different ways. During training, for half the time, the network’s self-produced outputs were passed back to the next step, and for the other half of the time, the ground- truth outputs were passed back to the next step (teacher forcing; Williams & Zipser, 1989). The networks were implemented in PyTorch and based on a standard seq2seq implementation.6 Training accuracy was above 99.5% for the overall-best network in each of the key experiments, and it was at least 95% for the top-performers in each experiment specifically. 5Note that, in all experiments, the number of distinct training commands is well below 100k: we randomly sampled them with replacement to reach the target size 6The code we used is publicly available at the link: seq2seq_translation_tutorial.html tions and produce the appropriate action sequence based solely on extrapolation from the background training. Experiment 1: Generalizing to a random subset of commands In this experiment, the SCAN tasks were randomly split into a training set (80%) and a test set (20%). The training set provides broad coverage of the task space, and the test set examines how networks can decompose and recombine commands from the training set. For instance, the network is asked to perform the new command, “jump opposite right after walk around right thrice,” as a zero-shot generaliza- tion in the test set. Although the conjunction as a whole is novel, the parts are not: The training set features many ex- amples of the parts in other contexts, e.g., “jump opposite right after turn opposite right” and “jump right twice after walk around right thrice” (both bold sub-strings appear 83 times in the training set). To succeed, the network needs to generalize by recombining pieces of existing commands to interpret new ones. Overall, the networks were highly successful at general- ization. The top-performing network for this experiment achieved 99.8% correct on the test set (accuracy values here and below are averaged over the five training runs). The top- performing architecture was a LSTM with no attention, 2 layers of 200 hidden units, and no dropout. The best-overall network achieved 99.7% correct. Interestingly, not every architecture was successful: Classic SRNs performed very poorly, and the best SRN achieved less than 1.5% correct at test time (performance on the training set was equally low). However, attention-augmented SRNs learned the commands much better, achieving 59.7% correct on average for the test set (with a range between 18.4% and 94.0% across SRN Generalization without Systematicity jump ) JUMP jump left ) LTURN JUMP jump around right ) RTURN JUMP RTURN JUMP RTURN JUMP RTURN JUMP turn left twice ) LTURN LTURN jump thrice ) JUMP JUMP JUMP jump opposite left and walk thrice ) LTURN LTURN JUMP WALK WALK WALK jump opposite left after walk around left ) LTURN WALK LTURN WALK LTURN WALK LTURN WALK LTURN LTURN JUMP Figure 1. Examples of SCAN commands (left) and the corresponding action sequences (right). Figure 2. The seq2seq framework is applied to SCAN. The sym- bols <EOS> and <SOS> denote end-of-sentence and start-of- sentence, respectively. The encoder (left) ends with the first <EOS> symbol, and the decoder (right) begins with <SOS>. 200 hidden units per layer, no attention, and dropout applied at the 0.5 level. Although the detailed analyses to 4. Experiments In each of the following experiments, the recurrent networks are trained on a large set of commands from the SCAN tasks to establish background knowledge as outlined above. After training, the networks are then evaluated on new commands designed to test generalization beyond the background set in systematic, compositional ways. In evaluating these new commands, the networks must make zero-shot generaliza- tions and produce the appropriate action sequence based solely on extrapolation from the background training. Experiment 1: Generalizing to a random subset of commands In this experiment, the SCAN tasks were randomly split jump / SCAN jump / NACS right / SCAN right / NACS seq2seq 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 + GECA 0.87 ± 0.02 0.67 ± 0.01 0.82 ± 0.04 0.82 ± 0.03 Table 1: Sequence match accuracies on SCAN datasets, in which the learner must generalize to new compositional uses of a single lexical item (“jump”) or multi-word modifier (“around right”) when mapping instructions to action sequences (SCAN) or vice-versa (NACS, Bastings et al., 2018). While the sequence-to-sequence model is unable to make any correct generalizations at all, applying GECA enables it to succeed most of the time. Scores are averaged over 10 random seeds; the standard deviation across seeds is shown. All improvements are significant (paired binomial test, p ⌧ 0.001). ʢ4$"/ͷݩจΑΓʣ ݩσʔλ݅ʹ ʢΘ͔ͣ ݅Ճ