Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Geometric Modeling of Crystal Structures using ...

Geometric Modeling of Crystal Structures using Transformers

Materials are everywhere around us—serving as semiconductors in computers, souvenir magnets on refrigerators, and lithium-ion battery components in smartphones. At the atomic level, these materials are made of atoms arranged in three-dimensional space with a remarkable property: periodicity.
For decades, materials scientists have explored how such structures relate to material properties, using both experiments and simulations. Today, machine learning is bringing new perspectives to this challenge.
In this talk, I’ll introduce our recent work presented at ICLR 2024 and 2025 on transformer-based geometric encoders for crystal structures. These models aim to predict material properties by learning from the 3D periodic geometry of crystals. I’ll also touch on broader applications of these crystal encoders beyond property prediction, as explored in our group.

Avatar for Tatsunori Taniai

Tatsunori Taniai

June 05, 2025
Tweet

More Decks by Tatsunori Taniai

Other Decks in Research

Transcript

  1. 1 Geometric Modeling of Crystal Structures using Transformers Tatsunori Taniai

    Senior Researcher OMRON SINIC X Corporation Seminar at RIKEN Center for AIP June 5th, 2025, RIKEN AIP Nihonbashi Office 15:00 – 16:00
  2. 2 Tatsunori Taniai ―Short Bio― • 2017: Ph.D. from UTokyo

    (Advisor: Prof. Yoichi Sato) • 2017-2019: PD at RIKEN AIP (Discrete Optimization Unit led by Dr. Maehara) • Since 2019: Senior Researcher at OMRON SINIC X Stereo & flow & motion seg [CVPR 17] Semantic correspondence [CVPR 16] Stereo depth estimation [CVPR 14, TPAMI 18] Binary MRF optimization [BMVC 12, CVPR 15] My PhD thesis was about discrete optimization for low-level computer vision tasks such as stereo and segmentation, without any “learning”
  3. 3 Tatsunori Taniai ―Short Bio― • 2017: Ph.D. from UTokyo

    (Advisor: Prof. Yoichi Sato) • 2017-2019: PD at RIKEN AIP (Discrete Optimization Unit led by Dr. Maehara) • Since 2019: Senior Researcher at OMRON SINIC X In the deep learning era, I have been seeking methodologies for integrating physical or algorithmic principles into deep learning-based methods. Physics-based self-supervised learning [Taniai & Maehara, ICML 18] Neural A* for learning path planning problems [Yonetani & Taniai+, ICML 21] Transformer encoders for crystals [Taniai+, ICLR 24] [Ito & Taniai+, ICLR 25]
  4. 4 Substances as 3D point clouds of atoms Molecules Proteins

    Crystals • 3D structures of up to 100s of atoms • Huge molecules with 1k to 10k atoms • Coded as 1D amino- acid sequences • Infinite number of atoms with periodicity • Focus in this talk Substances are atoms forming stable structures in 3D space
  5. 5 Transformer encoders for understanding contexts He runs a company

    . Article Pronoun Verb or noun Noun Period Subject Object (business org) Article EOS Verb (jog or manage?) Self-attention with 1D sequential positions Subject Object (business org) Article EOS Verb (manage) Self-attention with 1D sequential positions Self-attention can estimate context-dependent meanings of words in text
  6. 6 Transformer encoders for understanding contexts He runs a dog

    . Article Pronoun Verb or noun Noun Period Subject Object (animal) Article EOS Verb (jog or manage?) Self-attention with 1D sequential positions Subject Object (pet) Article EOS Verb (train/exercise) Self-attention with 1D sequential positions Self-attention can estimate context-dependent meanings of words in text
  7. 7 Transformer encoders for analyzing substance structures H O H

    Oxygen Hydrogen Hydrogen Structure of H2 O molecule "H2O Molecule" (https://skfb.ly/6QWvZ) by Mehdi Mirzaie is licensed under Creative Commons Attribution (http://creativecommons.org/licenses/by/4.0/).
  8. 8 Transformer encoders for analyzing substance structures H O H

    Oxygen Hydrogen Hydrogen Abstract state of H in H2O Abstract state of O in H2O Abstract state of H in H2O Self-attention with 3D spatial positions Self-attention with 3D spatial positions Use self-attention to evolve atomic states in given spatial configurations (Atomic token) Task-related state of H in H2O Task-related state of O in H2O Task-related state of H in H2O
  9. 9 Geometric deep learning for materials science Property prediction Structure

    prediction Foundation models Input Output • High-throughput screening of materials • Basic benchmark tasks for material encoders • Need invariant encoders • Generate novel structures (e.g., inverse design) • Find stabler structures • Predict chemical reactions • Need equivariant decoders • Predict high-level functionalities of materials • Map materials space • Material encoders as an interface to multimodal FMs Today’s main topic Briefly introduce our recent results at the end
  10. 10 Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding Tatsunori

    Taniai OMRON SINIC X Corporation Ryo Igarashi OMRON SINIC X Corporation Yuta Suzuki Toyota Motor Corporation Naoya Chiba Tohoku University Kotaro Saito Randeft Inc. Osaka University Yoshitaka Ushiku OMRON SINIC X Corporation Kanta Ono Osaka University The Twelfth International Conference on Learning Representations May 7th through 11th, 2024 at Messe Wien Exhibition and Congress Center Vienna, Austria 2024
  11. 11 Materials science and crystal structures Materials science Explore and

    develop new materials with useful properties and functionalities, such as superconductors and battery materials. Crystal structure • Source code of material. • Infinitely repeating, periodic arrangement of atoms in 3D space. • Described by a minimum repeatable pattern called unit cell. Crystal structure of NaCl Unit cell
  12. 12 Material property prediction Crystal structure Material properties • Formation

    energy • Total energy • Bandgap • Energy above hull • etc. Neural network • High-throughput alternative to physics simulation. • Material screening to accelerate material discovery and development. • Interfaces to multimodal foundation models via crystal encoders. (unit cell)
  13. 13 Interatomic message passing layers Periodic SE(3) invariant prediction Material

    properties • Formation energy • Total energy • Bandgap • Energy above hull • etc. Rotation Translation Periodic boundary shift Crystal structure • Evolve the state feature of each unit cell atom via interatomic interactions. • For property prediction, networks need to be invariant under periodic SE(3) transformations of atomic positions. • While not our focus, force prediction requires networks to be SE(3) equivariant.
  14. 14 Crystals Molecules Advances in ML Advances in material representation

    learning Many geometric GNNs • Duvenaud+ 2015 • Kearnes+ 2016 • Gilmer+ 2017 Success of Transformers • Graphormer [Ying+ 2021] • Equiformer [Liao+ 2023] CNNs (2011-) • ResNet [He+ 2015] GNNs (2015-) • PointNet [Qi+ 2016] • DeepSets [Zaheer+ 2017] • GCN [Kipf & Welling 2017] • GIN [Xu+ 2018] Transformers (2017-) • Transformer [Vaswani+ 2017] • BERT [Devlin+ 2018] • Image generation [Parmar+ 2018] • ViT [Dosovitskiy+ 2020] 3D arrangement of finite atoms 3D arrangement of infinite atoms Many geometric GNNs • CGCNN [Xie & Grossman, 2018] • SchNet [Schütt+ 2018] • MEGNet [Chen+ 2019] Emergence of Transformers • Matformer [Yan+ 2022] Graphormer (2021) demonstrated the effectiveness of fully connected self-attention for molecules. But its applicability to infinitely periodic crystal structures remains an open question.
  15. 15 Atomic state evolution by self-attention Molecule Fully connected self-attention

    for finite elements Relative position representations and • Encode relative position between atoms and . • | | scalar bias for softmax logits. • | | : vector bias for value features. • Distance-based representations ensure SE(3) invariance. Atom-wise state features • , , and : linear projections of input atom-wise state feature . • : output atom-wise state feature.
  16. 16 Atomic state evolution by self-attention Molecule Crystal structure Fully

    connected self-attention for finite elements Infinitely connected self-attention for periodic elements Unit cell (𝒏) and (𝒏) encode relative position (𝒏) to reflect periodic unit cell shifts 𝒏.
  17. 18 Interpretation as neural potential summation where Distance-decay attention Interpreted

    as interatomic energy calculations in abstract feature space • 𝑟 ─ Abstract interatomic potential between atoms and ( ) • ── Abstract influences on atom from atom ( ) Analogy to potential summation in physics simulations • For example, the electric potential energy between one and many particles, with electric charges , is calculated as 𝐽𝑖 = σ 1 4 0 − .
  18. 19 Periodic spatial encoding Periodic edge encoding Performed as finite-element

    self-attention Infinitely connected attention can be performed just like standard self-attention for finite elements with new position encoding and .
  19. 20 Evaluations on the Materials Project dataset Form E. Bandgap

    Bulk mod. Shear mod. Train/Val/Test 60000/5000/4239 60000/5000/4239 4664/393/393 4664/392/393 MAE unit eV/atom eV log(GPa) log(GPa) CGCNN [Xie & Grossman, 2018] SchNet [Schütt+, 2018] MEGNet [Chen+, 2019] GATGNN [Louis+, 2020] M3GNet [Chen & Ong, 2022] ALIGNN [Choudhary & DeCost, 2021] Matformer [Yan+, 2022] PotNet [Lin+, 2023] 0.031 0.033 0.03 0.033 0.024 0.022 0.021 0.0188 0.292 0.345 0.307 0.28 0.247 0.218 0.211 0.204 0.047 0.066 0.06 0.045 0.05 0.051 0.043 0.04 0.077 0.099 0.099 0.075 0.087 0.078 0.073 0.065 Crystalformer 0.0198 0.201 0.0399 0.0692 Consistently outperforms most of the existing methods in various property prediction tasks, while competitive with the GNN-based SOTA, PotNet [Lin+, 2023]. Win
  20. 21 Evaluations on the JARVIS-DFT 3D 2021 dataset Form E.

    Total E. Bandgap (OPT) Bandgap (MBJ) E hull Train/Val/Test 44578/5572/5572 44578/5572/5572 44578/5572/5572 14537/1817/1817 44296/5537/5537 MEA unit eV/atom eV/atom eV eV eV CGCNN [Xie & Grossman, 2018] SchNet [Schütt+, 2018] MEGNet [Chen+, 2019] GATGNN [Louis+, 2020] M3GNet [Chen & Ong, 2022] ALIGNN [Choudhary & DeCost, 2021] Matformer [Yan+, 2022] PotNet [Lin+, 2023] 0.063 0.045 0.047 0.047 0.039 0.0331 0.0325 0.0294 0.078 0.047 0.058 0.056 0.041 0.037 0.035 0.032 0.2 0.19 0.145 0.17 0.145 0.142 0.137 0.127 0.41 0.43 0.34 0.51 0.362 0.31 0.3 0.27 0.17 0.14 0.084 0.12 0.095 0.076 0.064 0.055 Crystalformer 0.0319 0.0342 0.131 0.275 0.0482 Win Consistently outperforms most of the existing methods in various property prediction tasks, while competitive with the GNN-based SOTA, PotNet [Lin+, 2023].
  21. 22 Model efficiency comparison • Our model achieves higher efficiency

    than SOTA methods, such as PotNet (GNN-based) and Matfomer (transformer-based). • Our architecture remains simple and closely follows the original transformer encoder, unlike Matformer, which involves many architectural modifications. Arch. type Train/Epoch Total train Test/Mater. # Params # Params/Block PotNet [Lin+, 2023] Matformer [Yan+, 2022] Crystalformer GNN Transformer Transformer 43 s 60 s 32 s 5.9 h 8.3 h 7.2 h 313 ms 20.4 ms 6.6 ms 1.8 M 2.9 M 853 K 527 K 544 K 206 K Multi-head attention (Figure 2) + Concat Linear + Feed forward Self-attention block Self-attention block Self-attention block Self-attention block Pooling Feed forward Train and test times are evaluated on JARVIS-DFT 3D (formulation energy) dataset.
  22. 23 Fourier-space attention for long-range interactions Spatial Fourier transform Long-tail

    Gaussians with large σ in real space become short-tail Gaussians in Fourier space (or reciprocal space), enabling long-range interatomic interactions via self-attention.  Decays slowly with increasing |n| when σ is large Huge improvement over SOTA (0.055)
  23. 24 Rethinking the role of frames for SE(3)-invariant crystal structure

    modeling Tatsunori Taniai* OMRON SINIC X Corporation Ryo Igarashi OMRON SINIC X Corporation Yusei Ito* ONRON SINIC X Intern Osaka University (D1) Yoshitaka Ushiku OMRON SINIC X Corporation Kanta Ono Osaka University The Thirteenth International Conference on Learning Representations May 24th through 28th, 2025 at Singapore EXPO Singapore 2025
  24. 25 Atomic state evolution by self-attention Infinitely connected self-attention for

    crystals (Crystalformer [Taniai+, ICLR 25]) Distance-decay attention. Linear projection of radial basis functions (RBF) that encode a distance into a soft one-hot vector. Distance-based models ensure invariance under SE(3) transformations (rotation and translation) but have limited expressive power. Two types of position encodings
  25. 26 Enhancing model expressivity under rotation invariance Invariant features Frames

    Equivariant features • Use distances btw pairs  Limited expressivity • Use angles btw triplets  Requires modeling many combinations of 3-body interactions • Standardize the orientations • Find structure-aligned coord. systems (e.g., using PCA). • ☺ No restriction on the architectural design • Use spherical tensors in SO(3)-equivariant nets. •  Restricted nonlinearity •  Heavy and limited angular resolution •  Mathematically difficult Our focus 𝜃 Canonical representation Image from torch-harmonics [Bonev+, 2023] 𝒆 1 𝒆2 𝒆3
  26. 27 Challenges and questions in frame-based crystal modeling Crystals are

    infinite Unit cells are artificial What are frames for? • How can we define a standard orientation for such structures? • Apparently different slices can represent the same crystal. • Should we rely on such arbitrary representations? • There are many possible ways to construct frames. • What makes a good frame? • Is orientation normalization alone sufficient? Canonical representation
  27. 28 Rethinking the role of frames Frames are ultimately used

    in GNNs’ message passing layers to derive richer yet invariant information than distance for the message func ← . Message passing layer  Not rotation invariant ☺ Rotation invariant Frame transformation (e.g., eigenvecs of PCA) 𝐹 = 𝒆 1 𝒆 2 𝒆 3 𝑇 • = − : relative position • ← : message from to • : scalar weight
  28. 29 Dynamic frames Let’s dynamically construct a frame 𝐹𝑖 for

    each target and each layer such that it normalizes the orientation of the local structure represented by . Message passing layer • Self-attention weights show which atoms actively interact with the target atom . • Use as a mask on the structure. 𝒆 1 𝒆2 Masked structure viewed from Dynamic frame
  29. 30 Dynamic frames: some analogies Message passing layer Image from

    https://www.vlfeat. org/overview/sift.h tml Rot-invariant local features Normalization layers Batch Norm Linear layer Layer Norm Linear layer Dynamic frames are expected to better normalize the structural information before passing it to the message function.
  30. 31 Dynamic frames: definitions Weighted PCA frames • Compute a

    weighted covariance matrix for each target atom: • Compute orthonormal eigenvectors 𝒆1 , 𝒆2 , 𝒆3 of Σ , corresponding to 𝜆1 ≥ 𝜆2 ≥ 𝜆3 , as the frame axes. Max frames (weight-prioritized point selection with orthogonalization) • Select 1 with the maximum weight and set 𝒆1 ← ത 1 . • Compute 𝒆2 similarly while ensuring orthogonality (i.e., 𝒆1 ⋅ 𝒆2 = 0). • Set 𝒆3 ← 𝒆1 × 𝒆2 . To ensure SE(3) invariance, we constrain to be a rotation matrix.
  31. 32 CrystalFramer: Crystalformer + dynamic frames Extend the distance-based edge

    feature term of Crystalformer by incorporating 3D direction vectors via dynamic frames. Frame-projected 3D direction vector Distance Invariant edge features
  32. 33 Evaluations on the JARVIS-DFT 3D 2021 dataset Comparisons between

    dynamic frames and their static counterparts (weighted PCA vs PCA; max vs static local) show that dynamic frames outperform conventional static frames. E form. E total. BG (OPT) BG (MBJ) E hull Matformer (Yan et al., 2022) 0.0325 0.035 0.137 0.30 0.064 PotNet (Lin et al., 2023) 0.0294 0.032 0.127 0.27 0.055 eComFormer (Yan et al., 2024) 0.0284 0.032 0.124 0.28 0.044 iComFormer (Yan et al., 2024) 0.0272 0.0288 0.122 0.26 0.047 Crystalformer (Taniai et al., 2024) 0.0306 0.0320 0.128 0.274 0.0463 ─ w/ PCA frames (Duval et al., 2023) 0.0325 0.0334 0.144 0.292 0.0568 ─ w/ lattice frames (Yan et al., 2024) 0.0302 0.0323 0.125 0.274 0.0531 ─ w/ static local frames 0.0285 0.0292 0.122 0.261 0.0444 ─ w/ weighted PCA frames (proposed) 0.0287 0.0305 0.126 0.279 0.0444 ─ w/ max frames (proposed) 0.0263 0.0279 0.117 0.242 0.0471 Counterparts
  33. 35 Neural structure fields (NeSF) for material decoding • Unlike

    point clouds of 3D surfaces in CV, decoding atomic systems is challenging due to their unknown and variable number of atoms. • We propose to represent point-based structures as continuous vector fields: Naoya Chiba*, Yuta Suzuki*, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono. Neural structure fields with application to crystal structure autoencoders. Communication Materials (2023). Latent of a structure 3D query point given arbitrarily Information about the nearest atom • 3D displacement vector • Categorical distribution of atomic species (e.g., H, He,…)
  34. 36 Field-based crystal autoencoders Achieved better reconstructions compared to conventional

    voxel- based decoding. Input Voxel-based Field-based Naoya Chiba*, Yuta Suzuki*, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono. Neural structure fields with application to crystal structure autoencoders. Communication Materials (2023). Voxel-based: often fails to reconstruct some atoms. Field-based Too many atoms Too few atoms
  35. 37 CLaSP: CLIP-like multimodal learning for materials science • CV

    has fostered large-scale datasets of images with textual annotations (e.g., ImageNet, MS-COCO), enabling multimodal learning between text and images (CLIP, 2021). • Materials science lacks such resources, mainly due to the difficulty of crowdsourcing. • Instead, we leverage a public database of 400k materials with publication metadata (titles and abstracts) to enable contrastive learning between text and structure. Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat.
  36. 38 Application: text-based retrieval of crystal structures Key finding: literature-driven

    learning enables models to predict high-level functionalities of crystal structures. Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat. Better
  37. 39 Application: visualization of materials space In the resulting embedding

    space, structures with similar properties automatically form clusters. t-SNE visualization Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat.
  38. 40 Summary of this talk Crystalformer [Taniai+, ICLR 2024] –

    A natural extension of standard transformers for periodic crystal structures – Distance-decay attention abstractly mimics energy calculations in physics – Fourier-space attention captures long-range interatomic interactions CrystalFramer [Ito & Taniai+, ICLR 2025] – Introduces dynamic frames derived from attention mechanisms to enhance expressive power Applications – Crystal encoders are immediately applicable to high-throughput property prediction – Serve as core components in embedding learning and multimodal foundation models Future directions – Extend to equivariant networks for structure and force-field prediction – Such networks are essential for generative modeling (e.g., diffusion models)
  39. 41