Machine Learning for Materials (Lecture 3)

Aron Walsh Department of Materials Centre for Processable Electronics Machine
Learning for Materials 3. Materials Data Module MATE70026

Module Contents 1. Introduction 2. Machine Learning Basics 3. Materials
Data 4. Crystal Representations 5. Classical Learning 6. Artificial Neural Networks 7. Building a Model from Scratch 8. Accelerated Discovery 9. Generative Artificial Intelligence 10. Recent Advances

Data-Driven Materials Research Pettifor maps A series of work on
structural classification of compounds and alloys Quickly predict the structure types of new compositions D. G. Pettifor, Materials Science and Technology 4, 675 (1988)

Data-Driven Materials Research D. G. Pettifor, Materials Science and Technology
4, 675 (1988) Hand-built features Mendeleev number is used for efficient grouping of structure types (to capture periodic trends)

Data-Driven Materials Research Structure-property correlations Connect crystal structure with measurable
properties (mechanical, electronic, etc.) Early analysis was manual and often focused on linear relations with physics-informed features J. C. Phillips, Rev. Mod. Phys. 42, 317 (1970)

Data Representation Choice of units or coordinate system can greatly
impact model performance More on this in the next class

Class Outline Materials Data A. Data sources and formats B.
API queries

https://xkcd.com/1683/

Where to Find Data? • Manual collection – go through
papers, extract data and tabulate (takes time) • Accelerated collection – use of natural language processing (requires model and workflow) • Pre-built databases – excellent when they exist in your area (may require access fees) • Automated experiments – generate your own data over a given parameter space (expensive)

Data Extraction from the Literature M. Schilling-Wilhelmi et al, Chem.
Soc. Rev. (2025) Leverage the vast literature of published papers

Data Extraction from the Literature M. Schilling-Wilhelmi et al, Chem.
Soc. Rev. (2025) Examples include https://github.com/mcs07/ChemDataExtractor and https://github.com/CederGroupHub/text-mined-synthesis_public Many tailored workflows are available based on regular expressions and/or statistical models

Why Share Data? • Reproducibility – allow direct comparison with
published literature beyond static tables and figures, e.g. raw spectra and diffraction patterns • Reuse – facilitate meta-studies comparing results from multiple experiments, e.g. variation in UV-vis spectra for different samples • Statistical models – power of machine learning depends on the quantity, quality, and diversity of training data

Common Forms of Data Sharing • Supporting information with publications
– often in the form of static pdf files (increasingly obsolete) • Data repositories – most institutions offer data upload portals, but often lack guidelines and metadata, e.g. zip or tar files • Community-specific repositories – best option if available, usually in a common format and searchable, with error detection

Common Forms of Data Sharing Many file types that differ
in how data is structured, stored, and compressed, but all easy to read in JSON is common as an open, flexible, and human-readable format

FAIR Data Standards https://www.howtofair.dk/what-is-fair • Findable: discoverable by humans &
machines with metadata & persistent identifiers (e.g. DOI) • Accessible: archived in long-term storage with clear access terms (e.g. CC open license) • Interoperable: exchangeable between different applications and systems using open file formats • Reusable: well documented and curated with clear terms and conditions on usage

Data Security • Privacy: protection of personal data e.g. General
Data Protection Regulation (GDPR) • Encryption: protocols for storage and transfer e.g. public key encryption, hashing • Access control: limiting users or computers e.g. passwords, firewalls • Data integrity: avoid corruption or modification e.g. data provenance tracking, regular versioning Not all databases are public, e.g. companies and academic-industrial collaborations

Crystallography in the Lead Cambridge Structural Database (from 1960) ….
1 million 2019 Human and Machine Readable Community Databases Standard Format https://www.ccdc.cam.ac.uk and https://checkcif.iucr.org

Crystallography in the Lead https://www.ccdc.cam.ac.uk and https://checkcif.iucr.org

Crystallography in the Lead VESTA software: https://jp-minerals.org/vesta/en/ Many open-source programs
for cif visualisation (including Miller indices, diffraction patterns…)

Example: General Repository https://zenodo.org/record/7828687

Example: Community Repository https://nomad-lab.eu/nomad-lab

Example: Curated Repository Physical Sciences Data Service on https://psds.ac.uk

Example: Materials Modelling https://materialsproject.org

Example: Microscopy https://www.ebi.ac.uk/emdb/about

Example: NMR https://nmrshiftdb.nmr.uni-koeln.de

Class Outline Materials Data A. Data sources and formats B.
API queries

Database Access Mode Advantage Disadvantage Web browser No knowledge of
database software is required Often one material at a time – slow for large datasets Data file All data is downloaded as one (e.g. zip or tar) file Specialist software often needed; data is not up-to-date API* (e.g. Python) Access latest data with advanced queries Some programming knowledge required *API = Application Programming Interface Tip: Keep a record of the database version you are using; data can change

Materials Database Access: Python API https://www.optimade.org

Query – Optimade https://www.optimade.org

Query – Materials Project (MPRester) https://github.com/materialsproject/api

Load a Dataset https://hackingmaterials.lbl.gov/matminer

Structure and Property Databases https://xkcd.com → http://cmx.io

Data Provenance Projects can combine data from many sources. Provenance
graphs are one way to link them https://www.aiida.net/sections/graph_gallery.html Connections between structures, calculations, and data Graph for a project on 324 covalent organic frameworks

Image Data Images are widely used in materials science. The
building blocks are pixels (e.g. 128⨉128) We will return to images in Lecture 6 Greyscale Pixel ∈ [1,255] Colour PR , PG , PB ∈ [1,255]

Knowledge Graphs Structured representation of knowledge to model properties and
their interrelations in a graph format https://github.com/materialsintelligence/propnet Properties Property relations & models

Class Outcomes 1. Describe the importance of materials data for
research and development 2. Demonstrate an understanding of the types of data that are shared in the materials community 3. Perform simple queries using an application programming interface Activity: Chemical space

Machine Learning for Materials (Lecture 3)

Machine Learning for Materials (Lecture 3)

More Decks by Aron Walsh

Other Decks in Science

Featured

Transcript