Geometric Modeling of Crystal Structures using Transformers

1 Geometric Modeling of Crystal Structures using Transformers Tatsunori Taniai
Senior Researcher OMRON SINIC X Corporation Seminar at RIKEN Center for AIP June 5th, 2025, RIKEN AIP Nihonbashi Office 15:00 – 16:00

2 Tatsunori Taniai ―Short Bio― • 2017: Ph.D. from UTokyo
(Advisor: Prof. Yoichi Sato) • 2017-2019: PD at RIKEN AIP (Discrete Optimization Unit led by Dr. Maehara) • Since 2019: Senior Researcher at OMRON SINIC X Stereo & flow & motion seg [CVPR 17] Semantic correspondence [CVPR 16] Stereo depth estimation [CVPR 14, TPAMI 18] Binary MRF optimization [BMVC 12, CVPR 15] My PhD thesis was about discrete optimization for low-level computer vision tasks such as stereo and segmentation, without any “learning”

3 Tatsunori Taniai ―Short Bio― • 2017: Ph.D. from UTokyo
(Advisor: Prof. Yoichi Sato) • 2017-2019: PD at RIKEN AIP (Discrete Optimization Unit led by Dr. Maehara) • Since 2019: Senior Researcher at OMRON SINIC X In the deep learning era, I have been seeking methodologies for integrating physical or algorithmic principles into deep learning-based methods. Physics-based self-supervised learning [Taniai & Maehara, ICML 18] Neural A* for learning path planning problems [Yonetani & Taniai+, ICML 21] Transformer encoders for crystals [Taniai+, ICLR 24] [Ito & Taniai+, ICLR 25]

4 Substances as 3D point clouds of atoms Molecules Proteins
Crystals • 3D structures of up to 100s of atoms • Huge molecules with 1k to 10k atoms • Coded as 1D amino- acid sequences • Infinite number of atoms with periodicity • Focus in this talk Substances are atoms forming stable structures in 3D space

5 Transformer encoders for understanding contexts He runs a company
. Article Pronoun Verb or noun Noun Period Subject Object (business org) Article EOS Verb (jog or manage?) Self-attention with 1D sequential positions Subject Object (business org) Article EOS Verb (manage) Self-attention with 1D sequential positions Self-attention can estimate context-dependent meanings of words in text

6 Transformer encoders for understanding contexts He runs a dog
. Article Pronoun Verb or noun Noun Period Subject Object (animal) Article EOS Verb (jog or manage?) Self-attention with 1D sequential positions Subject Object (pet) Article EOS Verb (train/exercise) Self-attention with 1D sequential positions Self-attention can estimate context-dependent meanings of words in text

7 Transformer encoders for analyzing substance structures H O H
Oxygen Hydrogen Hydrogen Structure of H2 O molecule "H2O Molecule" (https://skfb.ly/6QWvZ) by Mehdi Mirzaie is licensed under Creative Commons Attribution (http://creativecommons.org/licenses/by/4.0/).

8 Transformer encoders for analyzing substance structures H O H
Oxygen Hydrogen Hydrogen Abstract state of H in H2O Abstract state of O in H2O Abstract state of H in H2O Self-attention with 3D spatial positions Self-attention with 3D spatial positions Use self-attention to evolve atomic states in given spatial configurations (Atomic token) Task-related state of H in H2O Task-related state of O in H2O Task-related state of H in H2O

9 Geometric deep learning for materials science Property prediction Structure
prediction Foundation models Input Output • High-throughput screening of materials • Basic benchmark tasks for material encoders • Need invariant encoders • Generate novel structures (e.g., inverse design) • Find stabler structures • Predict chemical reactions • Need equivariant decoders • Predict high-level functionalities of materials • Map materials space • Material encoders as an interface to multimodal FMs Today’s main topic Briefly introduce our recent results at the end

10 Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding Tatsunori
Taniai OMRON SINIC X Corporation Ryo Igarashi OMRON SINIC X Corporation Yuta Suzuki Toyota Motor Corporation Naoya Chiba Tohoku University Kotaro Saito Randeft Inc. Osaka University Yoshitaka Ushiku OMRON SINIC X Corporation Kanta Ono Osaka University The Twelfth International Conference on Learning Representations May 7th through 11th, 2024 at Messe Wien Exhibition and Congress Center Vienna, Austria 2024

11 Materials science and crystal structures Materials science Explore and
develop new materials with useful properties and functionalities, such as superconductors and battery materials. Crystal structure • Source code of material. • Infinitely repeating, periodic arrangement of atoms in 3D space. • Described by a minimum repeatable pattern called unit cell. Crystal structure of NaCl Unit cell

12 Material property prediction Crystal structure Material properties • Formation
energy • Total energy • Bandgap • Energy above hull • etc. Neural network • High-throughput alternative to physics simulation. • Material screening to accelerate material discovery and development. • Interfaces to multimodal foundation models via crystal encoders. (unit cell)

13 Interatomic message passing layers Periodic SE(3) invariant prediction Material
properties • Formation energy • Total energy • Bandgap • Energy above hull • etc. Rotation Translation Periodic boundary shift Crystal structure • Evolve the state feature of each unit cell atom via interatomic interactions. • For property prediction, networks need to be invariant under periodic SE(3) transformations of atomic positions. • While not our focus, force prediction requires networks to be SE(3) equivariant.

14 Crystals Molecules Advances in ML Advances in material representation
learning Many geometric GNNs • Duvenaud+ 2015 • Kearnes+ 2016 • Gilmer+ 2017 Success of Transformers • Graphormer [Ying+ 2021] • Equiformer [Liao+ 2023] CNNs (2011-) • ResNet [He+ 2015] GNNs (2015-) • PointNet [Qi+ 2016] • DeepSets [Zaheer+ 2017] • GCN [Kipf & Welling 2017] • GIN [Xu+ 2018] Transformers (2017-) • Transformer [Vaswani+ 2017] • BERT [Devlin+ 2018] • Image generation [Parmar+ 2018] • ViT [Dosovitskiy+ 2020] 3D arrangement of finite atoms 3D arrangement of infinite atoms Many geometric GNNs • CGCNN [Xie & Grossman, 2018] • SchNet [Schütt+ 2018] • MEGNet [Chen+ 2019] Emergence of Transformers • Matformer [Yan+ 2022] Graphormer (2021) demonstrated the effectiveness of fully connected self-attention for molecules. But its applicability to infinitely periodic crystal structures remains an open question.

15 Atomic state evolution by self-attention Molecule Fully connected self-attention
for finite elements Relative position representations and • Encode relative position between atoms and . • | | scalar bias for softmax logits. • | | : vector bias for value features. • Distance-based representations ensure SE(3) invariance. Atom-wise state features • , , and : linear projections of input atom-wise state feature . • : output atom-wise state feature.

16 Atomic state evolution by self-attention Molecule Crystal structure Fully
connected self-attention for finite elements Infinitely connected self-attention for periodic elements Unit cell (𝒏) and (𝒏) encode relative position (𝒏) to reflect periodic unit cell shifts 𝒏.

17 Unit cell Interpretation as neural potential summation (𝑟) where
Distance-decay attention

18 Interpretation as neural potential summation where Distance-decay attention Interpreted
as interatomic energy calculations in abstract feature space • 𝑟 ─ Abstract interatomic potential between atoms and ( ) • ── Abstract influences on atom from atom ( ) Analogy to potential summation in physics simulations • For example, the electric potential energy between one and many particles, with electric charges , is calculated as 𝐽𝑖 = σ 1 4 0 − .

19 Periodic spatial encoding Periodic edge encoding Performed as finite-element
self-attention Infinitely connected attention can be performed just like standard self-attention for finite elements with new position encoding and .

20 Evaluations on the Materials Project dataset Form E. Bandgap
Bulk mod. Shear mod. Train/Val/Test 60000/5000/4239 60000/5000/4239 4664/393/393 4664/392/393 MAE unit eV/atom eV log(GPa) log(GPa) CGCNN [Xie & Grossman, 2018] SchNet [Schütt+, 2018] MEGNet [Chen+, 2019] GATGNN [Louis+, 2020] M3GNet [Chen & Ong, 2022] ALIGNN [Choudhary & DeCost, 2021] Matformer [Yan+, 2022] PotNet [Lin+, 2023] 0.031 0.033 0.03 0.033 0.024 0.022 0.021 0.0188 0.292 0.345 0.307 0.28 0.247 0.218 0.211 0.204 0.047 0.066 0.06 0.045 0.05 0.051 0.043 0.04 0.077 0.099 0.099 0.075 0.087 0.078 0.073 0.065 Crystalformer 0.0198 0.201 0.0399 0.0692 Consistently outperforms most of the existing methods in various property prediction tasks, while competitive with the GNN-based SOTA, PotNet [Lin+, 2023]. Win

21 Evaluations on the JARVIS-DFT 3D 2021 dataset Form E.
Total E. Bandgap (OPT) Bandgap (MBJ) E hull Train/Val/Test 44578/5572/5572 44578/5572/5572 44578/5572/5572 14537/1817/1817 44296/5537/5537 MEA unit eV/atom eV/atom eV eV eV CGCNN [Xie & Grossman, 2018] SchNet [Schütt+, 2018] MEGNet [Chen+, 2019] GATGNN [Louis+, 2020] M3GNet [Chen & Ong, 2022] ALIGNN [Choudhary & DeCost, 2021] Matformer [Yan+, 2022] PotNet [Lin+, 2023] 0.063 0.045 0.047 0.047 0.039 0.0331 0.0325 0.0294 0.078 0.047 0.058 0.056 0.041 0.037 0.035 0.032 0.2 0.19 0.145 0.17 0.145 0.142 0.137 0.127 0.41 0.43 0.34 0.51 0.362 0.31 0.3 0.27 0.17 0.14 0.084 0.12 0.095 0.076 0.064 0.055 Crystalformer 0.0319 0.0342 0.131 0.275 0.0482 Win Consistently outperforms most of the existing methods in various property prediction tasks, while competitive with the GNN-based SOTA, PotNet [Lin+, 2023].

22 Model efficiency comparison • Our model achieves higher efficiency
than SOTA methods, such as PotNet (GNN-based) and Matfomer (transformer-based). • Our architecture remains simple and closely follows the original transformer encoder, unlike Matformer, which involves many architectural modifications. Arch. type Train/Epoch Total train Test/Mater. # Params # Params/Block PotNet [Lin+, 2023] Matformer [Yan+, 2022] Crystalformer GNN Transformer Transformer 43 s 60 s 32 s 5.9 h 8.3 h 7.2 h 313 ms 20.4 ms 6.6 ms 1.8 M 2.9 M 853 K 527 K 544 K 206 K Multi-head attention (Figure 2) + Concat Linear + Feed forward Self-attention block Self-attention block Self-attention block Self-attention block Pooling Feed forward Train and test times are evaluated on JARVIS-DFT 3D (formulation energy) dataset.

23 Fourier-space attention for long-range interactions Spatial Fourier transform Long-tail
Gaussians with large σ in real space become short-tail Gaussians in Fourier space (or reciprocal space), enabling long-range interatomic interactions via self-attention.  Decays slowly with increasing |n| when σ is large Huge improvement over SOTA (0.055)

24 Rethinking the role of frames for SE(3)-invariant crystal structure
modeling Tatsunori Taniai* OMRON SINIC X Corporation Ryo Igarashi OMRON SINIC X Corporation Yusei Ito* ONRON SINIC X Intern Osaka University (D1) Yoshitaka Ushiku OMRON SINIC X Corporation Kanta Ono Osaka University The Thirteenth International Conference on Learning Representations May 24th through 28th, 2025 at Singapore EXPO Singapore 2025

25 Atomic state evolution by self-attention Infinitely connected self-attention for
crystals (Crystalformer [Taniai+, ICLR 25]) Distance-decay attention. Linear projection of radial basis functions (RBF) that encode a distance into a soft one-hot vector. Distance-based models ensure invariance under SE(3) transformations (rotation and translation) but have limited expressive power. Two types of position encodings

26 Enhancing model expressivity under rotation invariance Invariant features Frames
Equivariant features • Use distances btw pairs  Limited expressivity • Use angles btw triplets  Requires modeling many combinations of 3-body interactions • Standardize the orientations • Find structure-aligned coord. systems (e.g., using PCA). • ☺ No restriction on the architectural design • Use spherical tensors in SO(3)-equivariant nets. •  Restricted nonlinearity •  Heavy and limited angular resolution •  Mathematically difficult Our focus 𝜃 Canonical representation Image from torch-harmonics [Bonev+, 2023] 𝒆 1 𝒆2 𝒆3

27 Challenges and questions in frame-based crystal modeling Crystals are
infinite Unit cells are artificial What are frames for? • How can we define a standard orientation for such structures? • Apparently different slices can represent the same crystal. • Should we rely on such arbitrary representations? • There are many possible ways to construct frames. • What makes a good frame? • Is orientation normalization alone sufficient? Canonical representation

28 Rethinking the role of frames Frames are ultimately used
in GNNs’ message passing layers to derive richer yet invariant information than distance for the message func ← . Message passing layer  Not rotation invariant ☺ Rotation invariant Frame transformation (e.g., eigenvecs of PCA) 𝐹 = 𝒆 1 𝒆 2 𝒆 3 𝑇 • = − : relative position • ← : message from to • : scalar weight

29 Dynamic frames Let’s dynamically construct a frame 𝐹𝑖 for
each target and each layer such that it normalizes the orientation of the local structure represented by . Message passing layer • Self-attention weights show which atoms actively interact with the target atom . • Use as a mask on the structure. 𝒆 1 𝒆2 Masked structure viewed from Dynamic frame

30 Dynamic frames: some analogies Message passing layer Image from
https://www.vlfeat. org/overview/sift.h tml Rot-invariant local features Normalization layers Batch Norm Linear layer Layer Norm Linear layer Dynamic frames are expected to better normalize the structural information before passing it to the message function.

31 Dynamic frames: definitions Weighted PCA frames • Compute a
weighted covariance matrix for each target atom: • Compute orthonormal eigenvectors 𝒆1 , 𝒆2 , 𝒆3 of Σ , corresponding to 𝜆1 ≥ 𝜆2 ≥ 𝜆3 , as the frame axes. Max frames (weight-prioritized point selection with orthogonalization) • Select 1 with the maximum weight and set 𝒆1 ← ത 1 . • Compute 𝒆2 similarly while ensuring orthogonality (i.e., 𝒆1 ⋅ 𝒆2 = 0). • Set 𝒆3 ← 𝒆1 × 𝒆2 . To ensure SE(3) invariance, we constrain to be a rotation matrix.

32 CrystalFramer: Crystalformer + dynamic frames Extend the distance-based edge
feature term of Crystalformer by incorporating 3D direction vectors via dynamic frames. Frame-projected 3D direction vector Distance Invariant edge features

33 Evaluations on the JARVIS-DFT 3D 2021 dataset Comparisons between
dynamic frames and their static counterparts (weighted PCA vs PCA; max vs static local) show that dynamic frames outperform conventional static frames. E form. E total. BG (OPT) BG (MBJ) E hull Matformer (Yan et al., 2022) 0.0325 0.035 0.137 0.30 0.064 PotNet (Lin et al., 2023) 0.0294 0.032 0.127 0.27 0.055 eComFormer (Yan et al., 2024) 0.0284 0.032 0.124 0.28 0.044 iComFormer (Yan et al., 2024) 0.0272 0.0288 0.122 0.26 0.047 Crystalformer (Taniai et al., 2024) 0.0306 0.0320 0.128 0.274 0.0463 ─ w/ PCA frames (Duval et al., 2023) 0.0325 0.0334 0.144 0.292 0.0568 ─ w/ lattice frames (Yan et al., 2024) 0.0302 0.0323 0.125 0.274 0.0531 ─ w/ static local frames 0.0285 0.0292 0.122 0.261 0.0444 ─ w/ weighted PCA frames (proposed) 0.0287 0.0305 0.126 0.279 0.0444 ─ w/ max frames (proposed) 0.0263 0.0279 0.117 0.242 0.0471 Counterparts

34 Other work in our group

35 Neural structure fields (NeSF) for material decoding • Unlike
point clouds of 3D surfaces in CV, decoding atomic systems is challenging due to their unknown and variable number of atoms. • We propose to represent point-based structures as continuous vector fields: Naoya Chiba*, Yuta Suzuki*, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono. Neural structure fields with application to crystal structure autoencoders. Communication Materials (2023). Latent of a structure 3D query point given arbitrarily Information about the nearest atom • 3D displacement vector • Categorical distribution of atomic species (e.g., H, He,…)

36 Field-based crystal autoencoders Achieved better reconstructions compared to conventional
voxel- based decoding. Input Voxel-based Field-based Naoya Chiba*, Yuta Suzuki*, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono. Neural structure fields with application to crystal structure autoencoders. Communication Materials (2023). Voxel-based: often fails to reconstruct some atoms. Field-based Too many atoms Too few atoms

37 CLaSP: CLIP-like multimodal learning for materials science • CV
has fostered large-scale datasets of images with textual annotations (e.g., ImageNet, MS-COCO), enabling multimodal learning between text and images (CLIP, 2021). • Materials science lacks such resources, mainly due to the difficulty of crowdsourcing. • Instead, we leverage a public database of 400k materials with publication metadata (titles and abstracts) to enable contrastive learning between text and structure. Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat.

38 Application: text-based retrieval of crystal structures Key finding: literature-driven
learning enables models to predict high-level functionalities of crystal structures. Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat. Better

39 Application: visualization of materials space In the resulting embedding
space, structures with similar properties automatically form clusters. t-SNE visualization Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat.

40 Summary of this talk Crystalformer [Taniai+, ICLR 2024] –
A natural extension of standard transformers for periodic crystal structures – Distance-decay attention abstractly mimics energy calculations in physics – Fourier-space attention captures long-range interatomic interactions CrystalFramer [Ito & Taniai+, ICLR 2025] – Introduces dynamic frames derived from attention mechanisms to enhance expressive power Applications – Crystal encoders are immediately applicable to high-throughput property prediction – Serve as core components in embedding learning and multimodal foundation models Future directions – Extend to equivariant networks for structure and force-field prediction – Such networks are essential for generative modeling (e.g., diffusion models)

Geometric Modeling of Crystal Structures using ...

Geometric Modeling of Crystal Structures using Transformers

More Decks by Tatsunori Taniai

Other Decks in Research

Featured

Transcript