Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning for Materials (Lecture 3)

Aron Walsh
January 29, 2024

Machine Learning for Materials (Lecture 3)

Slides linked to https://github.com/aronwalsh/MLforMaterials. Updated for 2025.

Aron Walsh

January 29, 2024
Tweet

More Decks by Aron Walsh

Other Decks in Science

Transcript

  1. Aron Walsh Department of Materials Centre for Processable Electronics Machine

    Learning for Materials 3. Materials Data Module MATE70026
  2. Module Contents 1. Introduction 2. Machine Learning Basics 3. Materials

    Data 4. Crystal Representations 5. Classical Learning 6. Artificial Neural Networks 7. Building a Model from Scratch 8. Accelerated Discovery 9. Generative Artificial Intelligence 10. Recent Advances
  3. Data-Driven Materials Research Pettifor maps A series of work on

    structural classification of compounds and alloys Quickly predict the structure types of new compositions D. G. Pettifor, Materials Science and Technology 4, 675 (1988)
  4. Data-Driven Materials Research D. G. Pettifor, Materials Science and Technology

    4, 675 (1988) Hand-built features Mendeleev number is used for efficient grouping of structure types (to capture periodic trends)
  5. Data-Driven Materials Research Structure-property correlations Connect crystal structure with measurable

    properties (mechanical, electronic, etc.) Early analysis was manual and often focused on linear relations with physics-informed features J. C. Phillips, Rev. Mod. Phys. 42, 317 (1970)
  6. Data Representation Choice of units or coordinate system can greatly

    impact model performance More on this in the next class
  7. Where to Find Data? • Manual collection – go through

    papers, extract data and tabulate (takes time) • Accelerated collection – use of natural language processing (requires model and workflow) • Pre-built databases – excellent when they exist in your area (may require access fees) • Automated experiments – generate your own data over a given parameter space (expensive)
  8. Data Extraction from the Literature M. Schilling-Wilhelmi et al, Chem.

    Soc. Rev. (2025) Leverage the vast literature of published papers
  9. Data Extraction from the Literature M. Schilling-Wilhelmi et al, Chem.

    Soc. Rev. (2025) Examples include https://github.com/mcs07/ChemDataExtractor and https://github.com/CederGroupHub/text-mined-synthesis_public Many tailored workflows are available based on regular expressions and/or statistical models
  10. Why Share Data? • Reproducibility – allow direct comparison with

    published literature beyond static tables and figures, e.g. raw spectra and diffraction patterns • Reuse – facilitate meta-studies comparing results from multiple experiments, e.g. variation in UV-vis spectra for different samples • Statistical models – power of machine learning depends on the quantity, quality, and diversity of training data
  11. Common Forms of Data Sharing • Supporting information with publications

    – often in the form of static pdf files (increasingly obsolete) • Data repositories – most institutions offer data upload portals, but often lack guidelines and metadata, e.g. zip or tar files • Community-specific repositories – best option if available, usually in a common format and searchable, with error detection
  12. Common Forms of Data Sharing Many file types that differ

    in how data is structured, stored, and compressed, but all easy to read in JSON is common as an open, flexible, and human-readable format
  13. FAIR Data Standards https://www.howtofair.dk/what-is-fair • Findable: discoverable by humans &

    machines with metadata & persistent identifiers (e.g. DOI) • Accessible: archived in long-term storage with clear access terms (e.g. CC open license) • Interoperable: exchangeable between different applications and systems using open file formats • Reusable: well documented and curated with clear terms and conditions on usage
  14. Data Security • Privacy: protection of personal data e.g. General

    Data Protection Regulation (GDPR) • Encryption: protocols for storage and transfer e.g. public key encryption, hashing • Access control: limiting users or computers e.g. passwords, firewalls • Data integrity: avoid corruption or modification e.g. data provenance tracking, regular versioning Not all databases are public, e.g. companies and academic-industrial collaborations
  15. Crystallography in the Lead Cambridge Structural Database (from 1960) ….

    1 million 2019 Human and Machine Readable Community Databases Standard Format https://www.ccdc.cam.ac.uk and https://checkcif.iucr.org
  16. Crystallography in the Lead VESTA software: https://jp-minerals.org/vesta/en/ Many open-source programs

    for cif visualisation (including Miller indices, diffraction patterns…)
  17. Database Access Mode Advantage Disadvantage Web browser No knowledge of

    database software is required Often one material at a time – slow for large datasets Data file All data is downloaded as one (e.g. zip or tar) file Specialist software often needed; data is not up-to-date API* (e.g. Python) Access latest data with advanced queries Some programming knowledge required *API = Application Programming Interface Tip: Keep a record of the database version you are using; data can change
  18. Data Provenance Projects can combine data from many sources. Provenance

    graphs are one way to link them https://www.aiida.net/sections/graph_gallery.html Connections between structures, calculations, and data Graph for a project on 324 covalent organic frameworks
  19. Image Data Images are widely used in materials science. The

    building blocks are pixels (e.g. 128⨉128) We will return to images in Lecture 6 Greyscale Pixel ∈ [1,255] Colour PR , PG , PB ∈ [1,255]
  20. Knowledge Graphs Structured representation of knowledge to model properties and

    their interrelations in a graph format https://github.com/materialsintelligence/propnet Properties Property relations & models
  21. Class Outcomes 1. Describe the importance of materials data for

    research and development 2. Demonstrate an understanding of the types of data that are shared in the materials community 3. Perform simple queries using an application programming interface Activity: Chemical space