ICLR 2024

From Molecules to Materials

Pre-training Large Generalizable Models for Atomic Property Prediction

1FAIR at Meta2Carnegie Mellon University

*Corresponding author

t-SNE visualization of JMP embeddings across 10 datasets

t-SNE Visualization: JMP embeddings across 10 datasets showing unified atomic representations. Explore interactively in Colab →

Abstract

Atomic property prediction plays a central role in computational chemistry and materials science. However, developing accurate and efficient predictive models remains challenging due to the complex interactions between atoms and the large data requirements for training robust models.

We propose Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task learning framework.

Our combined pre-training dataset consists of approximately 120M systems from OC20, OC22, ANI-1x, and Transition-1x. We benchmark JMP on a diverse set of downstream tasks across small molecules, large molecules, and materials, evaluating fine-tuned models on QM9, rMD17, MD22, SPICE, MatBench, and QMOF datasets.

JMP achieves state-of-the-art results on 34 out of 40 downstream benchmarks. In addition, our work highlights the value of pre-training strategies that utilize diverse data to achieve generalizable representations in atomic property prediction.

Joint Multi-domain Pre-training

JMP Pre-training and Fine-tuning Overview

Figure 1: Overview of JMP showing joint pre-training on diverse datasets followed by efficient fine-tuning.

Pre-training

JMP pre-trains on approximately 120 million systems from four diverse datasets, learning unified atomic representations across molecular and materials domains.

OC20

Catalyst relaxation trajectories

~100M
OC22

Oxide catalyst relaxations

~8M
ANI-1x

Small molecule MD simulations

~2M
Transition-1x

Reaction transition states

~10M

Fine-tuning

Pre-trained JMP models can be efficiently fine-tuned on downstream tasks with minimal data, achieving state-of-the-art on 34/40 benchmarks.

Small Molecules

QM9, rMD17

2 datasets
Large Molecules

MD22, SPICE

2 datasets
Materials

MatBench, QMOF

2 datasets

Benchmark Results

JMP achieves state-of-the-art results across diverse benchmarks spanning small molecules, large molecules, and materials.

qm9 benchmark results

QM9 benchmark results comparing JMP to prior methods.

Key Results

Pre-training Improves Downstream Performance

JMP pre-training yields an average improvement of 59% across downstream tasks compared to training from scratch.

Pre-training Improves Downstream Performance

Pre-training Enables Larger Models

Pre-training allows effective scaling to larger models (JMP-L, 235M parameters) that consistently outperform smaller variants.

Pre-training Enables Larger Models

Pre-training Reduces Computational Cost

JMP significantly reduces the GPU hours required for downstream fine-tuning compared to training from scratch.

Pre-training Reduces Computational Cost

Per-Layer t-SNE Visualization

Watch how atomic representations evolve through the layers of the pre-trained JMP model, progressively learning more discriminative features.

Per-layer t-SNE visualization showing embedding evolution

Layer-by-layer evolution: Atomic embeddings become increasingly domain-specific through deeper layers of the network.

Video Presentation

Watch the ICLR 2024 presentation to learn more about JMP.

Citation

@inproceedings{shoghi2024molecules,
  title={From Molecules to Materials: Pre-training Large Generalizable
         Models for Atomic Property Prediction},
  author={Shoghi, Nima and Kolluru, Adeesh and Kitchin, John R and
          Ulissi, Zachary W and Zitnick, C Lawrence and Wood, Brandon M},
  booktitle={The Twelfth International Conference on Learning
             Representations},
  year={2024}
}