structural classification of compounds and alloys Quickly predict the structure types of new compositions D. G. Pettifor, Materials Science and Technology 4, 675 (1988)
properties (mechanical, electronic, etc.) Early analysis was manual and often focused on linear relations with physics-informed features J. C. Phillips, Rev. Mod. Phys. 42, 317 (1970)
papers, extract data and tabulate (takes time) • Accelerated collection – use of natural language processing (requires model and workflow) • Pre-built databases – excellent when they exist in your area (may require access fees) • Automated experiments – generate your own data over a given parameter space (expensive)
Soc. Rev. (2025) Examples include https://github.com/mcs07/ChemDataExtractor and https://github.com/CederGroupHub/text-mined-synthesis_public Many tailored workflows are available based on regular expressions and/or statistical models
published literature beyond static tables and figures, e.g. raw spectra and diffraction patterns • Reuse – facilitate meta-studies comparing results from multiple experiments, e.g. variation in UV-vis spectra for different samples • Statistical models – power of machine learning depends on the quantity, quality, and diversity of training data
– often in the form of static pdf files (increasingly obsolete) • Data repositories – most institutions offer data upload portals, but often lack guidelines and metadata, e.g. zip or tar files • Community-specific repositories – best option if available, usually in a common format and searchable, with error detection
machines with metadata & persistent identifiers (e.g. DOI) • Accessible: archived in long-term storage with clear access terms (e.g. CC open license) • Interoperable: exchangeable between different applications and systems using open file formats • Reusable: well documented and curated with clear terms and conditions on usage
Data Protection Regulation (GDPR) • Encryption: protocols for storage and transfer e.g. public key encryption, hashing • Access control: limiting users or computers e.g. passwords, firewalls • Data integrity: avoid corruption or modification e.g. data provenance tracking, regular versioning Not all databases are public, e.g. companies and academic-industrial collaborations
database software is required Often one material at a time – slow for large datasets Data file All data is downloaded as one (e.g. zip or tar) file Specialist software often needed; data is not up-to-date API* (e.g. Python) Access latest data with advanced queries Some programming knowledge required *API = Application Programming Interface Tip: Keep a record of the database version you are using; data can change
graphs are one way to link them https://www.aiida.net/sections/graph_gallery.html Connections between structures, calculations, and data Graph for a project on 324 covalent organic frameworks
research and development 2. Demonstrate an understanding of the types of data that are shared in the materials community 3. Perform simple queries using an application programming interface Activity: Chemical space