Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Paper Introduction] From Bytes to Ideas: Langu...

[Paper Introduction] From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

2025/06/27
Paper introduction @ TanichuLab
https://sites.google.com/view/tanichu-lab-ku/

Avatar for omoto haruumi

omoto haruumi

June 27, 2025
Tweet

Other Decks in Science

Transcript

  1. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets SES

    Lab’s Journal Club Calendar Haruumi Omoto Posted on 06/27/2025 1
  2. Paper Information 2 • Title: From Bytes to Ideas: Language

    Modeling with Autoregressive U-Nets • Authors: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz • Date: Submitted on 17 June 2025 • DOI: https://arxiv.org/abs/2506.14761v1
  3. Fixed tokenization limitations • Fix the language model’s operation and

    how far ahead it predicts • Each token is represented as an independent vector—an "opaque identifier" whose internal structure the model cannot access. • complicates the transfer of knowledge to dialects and low-resource languages Problem 3 Fixed Adaptive
  4. Solution 4 Autoregressive U-Net architecture (AU-Net) ① Learns representations dynamically

    from raw bytes, avoiding a predefined vocabulary ② Processes information hierarchically: deeper stages handle broad semantics, while shallower stages focus on fine details like spelling ③ Uses skip connections to blend high-level information with fine-grained details, enabling more accurate predictions ・・・ A A C A A C T ② Detail Broad ① Raw bytes ③Skip
  5. Contributions 5 1. Adaptive multi-level hierarchy ➢ An end-to-end embedding

    stages with arbitrary 2. Infinite vocab size ➢ Avoid predefined vocabularies 3. Strong performance and scaling ➢ Match strong BPE baselines with promising scaling trends 4. Practical Efficiency ➢ Maintain comparable training speed, with code publicity available 5. Stable scaling laws ➢ Propose new scaling laws for stable and smooth optimization
  6. Background: Limitation of Fixed Tokenization 6 • Tokens = independent

    vectors, model can’t see shared patterns s t r a w b e r r y s t r a w b e r r i e s Easily captures common substrings AU-Net Learn semantic similarity without assists BPE straw berry straw berries 301 1831 8396 20853 301 1831 Language Model Most Common Proposed +
  7. Other Limitations of Fixed Tokenization (From my research) 7 •

    Problems for multilingual environments [1] • Unable to perform task-optimized tokenization [2] • Not robust to typos, spelling variations, or morphological changes[3] • A barrier to distilling knowledge between different models • Cannot pre-define tokenization for emergent languages [1] Xue, Linting, et al. "Byt5: Towards a token-free future with pre-trained byte-to-byte models." Transactions of the Association for Computational Linguistics 10 (2022): 291-306 [2] Zheng, Mengyu, et al. "Enhancing large language models through adaptive tokenizers." Advances in Neural Information Processing Systems 37 (2024): 113545-113568. [3] Wang, Junxiong, et al. "Mambabyte: Token-free selective state space model." arXiv preprint arXiv:2401.13660 (2024).
  8. What is U-Net (From my research) 8 [2] Ronneberger, Olaf,

    Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer international publishing, 2015. Cited by [2] Skip Connection Encoder Decoder Local Global
  9. Proposed Method 9 • Hierarchical processing enables guidance at various

    levels of abstraction (e.g. Byte→One Word→Two Word) • Starting from the byte level allows for an unlimited vocabulary • Contraction reduces computational cost Stage1:Byte level Stage2: Word level Stage3: Two words level
  10. Detail of Proposed Method 10 1. Pooling and Upsampling ➢

    Pooling: selects indices from the splitting function and projects them linearly ➢ Upsampling: duplicates coarse vectors to match finer segments, applying position-specific linear transforms called Multi-Linear Upsampling 2. Splitting Function ➢ Supports flexible splitting strategies to define pooling points at each hierarchical stage ➢ Splits on spaces using different regular expressions at each stage in this paper 3. Evaluating on different scales ➢ Model size is defined by FLOPs per input unit, rather than the number of parameters ➢ FLOPs allows models like the BPE baseline and AU-Net to be compared on the same computational scale
  11. Experiment Settings 11 ⚫ Purpose ➢Evaluate the effectiveness of the

    proposed Autoregressive U-Net(AU-Net) ⚫ Dataset ➢DCLM dataset [3] (predominantly English & focus on natural language understanding) ⚫ Baselines ➢A Transformer model using the LLaMa 3 BPE tokenizer ➢A Transformer model trained directly on raw bytes ➢A Mamba model trained directly on raw bytes [3] Li, Jeffrey, et al. "Datacomp-lm: In search of the next generation of training sets for language models." Advances in Neural Information Processing Systems 37 (2024): 14200-14282.
  12. Equal Data Budget Results 12 • Hierarchical AU-Net models consistently

    matched or outperformed • Multi-stage models, such as AU-Net 3 and AU-Net 4, showed particularly strong performance
  13. Scaling Laws 13 • Proof of Effectiveness: ➢AU-Net proves its

    viability by matching the performance of a strong, optimized BPE baseline. • Future Potential: ➢While it lags on some tasks (GSM8K, MMLU), its delayed performance "take-off" suggests it could surpass BPE at larger scales.
  14. Transfer to low-resource languages 14 • AU-Net demonstrates strong performance

    on multilingual benchmarks • AU-Net captures shared spelling and morphological patterns across related languages • Improves translation from low- resource languages
  15. Ability to manipulate both words and characters 15 • The

    experiments highlight a natural trade-off between the models. • AU-Net performs better on character-manipulation tasks, such as spelling • The BPE baseline is stronger on word-level tasks
  16. Conclusion 16 • Conclusion ➢Introducing AU-Net, an autoregressive U-Net that

    learns hierarchical token representation from raw bytes ➢AU-Net can eliminate the need for predefined vocabularies ➢AU-Net matches performance of strong BPE baselines under controlled compute budgets • Limitation & Future Work ➢AU-Net currently does not support non-space-based languages(e.g. Chinese). A potential solution is to learn the splitting function directly. ➢As the number of stages increases, the efficiency of the parallelization framework becomes a challenge