« Back

Creating Ultra-Accurate Molecular Models by Applying Machine Learning Techniques

Creating Ultra-Accurate Molecular Models by Applying Machine Learning Techniques

XSEDE Resources Provide the GPU Capability to Develop New Models

A team led by researchers at UC San Diego's Department of Chemistry and Biochemistry and the San Diego Supercomputer Center (SDSC) has pioneered the use of machine learning techniques to develop models for simulations of Earth's most critical element – water – that can be extended to other generic molecules with what researchers call "unprecedented accuracy."

Their work, published recently in The Journal of Chemical Physics, demonstrates how popular machine learning techniques can be used to construct predictive molecular models based on quantum mechanical reference data. Molecular simulations using modern high-performance computing systems such as the ones provided via XSEDE are key to the rational design of novel materials with applications ranging from fuel cells to water purification systems, atmospheric climate models, and computational drug design.

"This is a new methodology that could revolutionize computational chemistry," noted SDSC Director Michael Norman, who also is the principal investigators for the Comet supercomputer, an XSEDE resource based at SDSC.

The team relied on the GPU-computing power and capabilities provided by Comet as well as Maverick, based at the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. Access to both systems was allocated through the Extreme Science and Engineering Discovery Environment (XSEDE), an NSF-funded program under which scientists can interactively share computing resources, data, and expertise.

"Although computer simulations have become a powerful tool for the modeling of water and for molecular sciences in general, they are still limited by a tradeoff between the accuracy of the molecular models and the associated computational cost," said Francesco Paesani, professor of chemistry and biochemistry at UC San Diego and the study's principal investigator.

"Now that we've proved this concept with a model of water using machine learning techniques, we are currently extending this novel approach to generic molecules," added Paesani. "Scientists will be able to predict the properties of molecules and materials with unprecedented accuracy."

Researchers used that term because these new models are more accurate than classical force fields that are currently used for molecular simulations, explained Andreas W. Goetz, a research scientist who directed the work at SDSC. Now researchers can make quantitative predictions of the properties of water, for example, where other models fail.

The new study builds on the highly accurate and successful "MB-pol many-body potential" for water developed in Paesani's lab, which recently has emerged as an accurate molecular model for water simulations from the gas to liquid to solid phases.

As reported in the paper, the researchers investigated the performance of three machine learning techniques – permutationally invariant polynomials, neural networks, and Gaussian approximation potentials – in representing many-body interactions in water. Machine learning typically involves ‘training' a computer or robot on millions of actions so that the computer learns how to derive insight and meaning from the data as time advances.

In the quantum world, all three methods have been consistently equivalent in reproducing large datasets involving the interaction of multiple particles – many-body phenomena such as two-body and three-body energies – as well as water cluster interaction energies, all with great accuracy.

"We have demonstrated that these different machine learning techniques can effectively be employed to encode the highly complex quantum mechanical many-body interactions that arise when molecules interact," said Thuong Nguyen, lead author of the study and a research scholar at UC San Diego when the research was conducted.

GPUs and Complex Neural Networks

As for future efforts, these findings are not only important because the models are highly accurate, but also because it means researchers can choose the algorithms that best map to the available hardware, according to SDSC's Goetz. "

Modern many-core processors, for instance, are well-suited to evaluate the complex expressions of the permutationally invariant polynomials, while massively parallel graphics processing units (GPUs) perform exceptionally well for neural networks," he said.

The development of complex neural networks with associated optimization processes was performed on Comet and Maverick via XSEDE allocations. "Currently, there are not many GPU resources conveniently available," said Goetz, a long-time user of XSEDE resources. "So we are grateful for XSEDE for providing access to such systems in a manner that saved us both time and expense, in turn accelerating our time to published results."

Also participating in the study were researchers at the École Polytechnique Fédérale de Lausanne in Switzerland, Cambridge University in England, and the University of Göttingen in Germany. This research was supported by the National Science Foundation through grant # ACI-1642336.

ABOUT XSEDE

The Extreme Science and Engineering Discovery Environment (XSEDE) is an NSF-funded single virtual organization which connects scientists with advanced digital resources. People around the world use these resources and services — things like supercomputers, collections of data, expert staff and new tools — to improve our planet. Learn how XSEDE can enable your research here. XSEDE is supported by the National Science Foundation through award ACI-1053575.


Machine learning techniques predict quantum mechanical many-body interactions in water. Shown is an example using neural networks for a water trimer (top left) from a simulation of liquid water (top right). Molecular descriptors encode the structural environment around oxygen atoms (red) and hydrogen atoms (white). When used as input for neural networks (blue boxes for oxygen, orange boxes for hydrogen), many-body energies can be calculated accurately. Credit: Andreas Goetz and Thuong Nguyen, SDSC/UC San Diego