Machine Learning in Chemistry: Data-Driven Algorithms, Learning Systems, and Predictions

✍ Scribed by Edward O. Pyzer-Knapp (editor), Teodoro Laino (editor)

Publisher: American Chemical Society
Year: 2020
Tongue: English
Leaves: 140
Series: ACS Symposium Series
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Artificial intelligence, and especially its application to chemistry, is an exciting and rapidly expanding area of research. This volume presents groundbreaking work in this field to facilitate researcher engagement and to serve as a solid base from which new researchers can break into this exciting and rapidly transforming field. This interdisciplinary volume will be a valuable tool for those working in cheminformatics, physical chemistry, and computational chemistry.

✦ Table of Contents

Contents
Preface
1
Atomic-Scale Representation and Statistical Learning of Tensorial Properties
Introduction
Figure 1. Structural descriptors should identify unequivocally and concisely the geometry and composition of a molecule or condensed phase.
Linear Regression
Tensors, Symmetries and Correlations
Translations
Rotations
Reflections
Covariant Descriptors
Figure 2. Provided that one can define a local reference system, it is possible to learn tensorial properties by aligning each molecule (or environment) into a fixed reference frame.
Covariant Regression
Figure 3. Representation of the reciprocal alignment between water environments.
Spherical Representation
SOAP Representation
λ-SOAP(1) Representation
λ-SOAP(2) Representation
Non-linearity
Implementation
Examples
Dielectric Response Series
Figure 4. Learning curves of the Zundel cation dielectric response series µ,α and β as decomposed in their anisotropic (λ > 0) spherical tensor components. Full and dashed lines refer to predictions that are carried out with λ-SOAP kernel functions that are covariant in SO(3) and O(3) respectively.
Electronic Charge Densities
Figure 5. Learning curves of the predicted charge density of 200 randomly selected butane molecules, when considering up to 800 reference molecules to train the model. The molecular geometries and computational details are the same as in Grisafi et al. 18 The black full line refers to the prediction error as reported in Grisafi et al. 18 Blue lines refer to the result obtained with the RI-cc-pV5Z basis, both with a λ-SOAP(2) descriptor covariant in SO (3) (full) and O(3) (dashed). Dotted lines refer to the basis set error. In both cases, 100 reference atomic environments have been used to define the problem dimensionality.
Conclusions
Acknowledgments
References
2
Prediction of Mohs Hardness with Machine Learning Methods Using Compositional Features
Introduction
Methods
Datasets
Classes
Figure 1. Histogram of the Mohs hardnesses of the datasets of (a) 622 naturally occurring mineral and (b) 52 artificially grown single crystals.
Features
ML Models
RFs and Gini Feature Importance
SVMs
Study Models
Grid Optimization for Binary and Ternary SVC: Models 1, 2, 8, and 9
Evaluation Criteria
Results and Discussion
Grid Optimization Results for Binary and Ternary SVCs: Models 1, 2, 8, and 9
Figure 2. The cross-validation grid optimization accuracies for the (a) binary and (b) ternary RBF SVCs, Models 1 and 2, respectively. The y axis is the value of C, or the soft margin cost function. The x axis is the value of , or the parameter for the Gaussian RBF. The cross-validation grid optimization accuracies for the (c) binary and (d) ternary Matérn SVCs, Models 8 and 9, respectively. The color represents the model accuracy.
Figure 3. Performance from models trained under 500 stratified train-test splits (a) workflow of model performance. (b) The specificity and recall scores from for all binary models. (c) The recall scores for ternary models. (d) The specificity scores for ternary models. The bar height corresponds to the average respective score. The black error bars correspond to the standard deviation of the respective metric.
Model Performance on Naturally Occurring Minerals Dataset
Figure 4. ROC plots using the false positive ratio and false negative ratio are used to evaluate the ability of classifiers to predict Mohs hardness values. (a) Comparison of binary RBF SVC, RF, Matérn SVC, Models 1, 3, and 8, respectively. (b) Comparison of the ternary OVR classifiers, Models 5, 6, and 7, which predict low, medium, and high hardness values, respectively.
Feature Importances
Figure 5. The Gini feature importances of a 10,000 tree RF for (a) binary (Model 3) and ternary multiclass (Model 4) models and (b) OVR binary RF classifiers (Models 5–7) with low, medium, and high hardness ternary class as a positive class, respectively.
Model Validation with Validation Set
Figure 6. (a) Workflow of performance testing for model validation. (b) The specificity and recall scores for all binary models. (c) The recall scores for ternary models. (d) The specificity scores for ternary models. The bar height corresponds to the average respective score. All models were trained on the naturally occurring mineral dataset and validated with the artificial single crystal dataset.
Considerations
Conclusions
Acknowledgments
References
3
High-Dimensional Neural Network Potentials for Atomistic Simulations
Introduction
HDNNPs
Overview
Symmetry Functions as Descriptors of the Local Chemical Environment
Construction of HDNNP
Case Studies: NaOH Solutions and the ZnO/Water Interface
Aqueous NaOH Solutions
The ZnO/Water Interface
Limitations and Strengths of HDNNPs
Summary
Acknowledgments
References
4
Data-Driven Learning Systems for Chemical Reaction Prediction: An Analysis of Recent Approaches
Introduction
Figure 1. The number of publications with topic "Organic Chemistry" on Web of Science from 1924 to 2019 (binned over 5 years) 19.
Representation and Formats
Chemical Reaction Data
Figure 2. A Bromo Grignard reaction with nontrivial atom-mapping, as any phenyl group in the product could correspond to any phenyl in the reactants. There is more than one correct atom-mapping. The reaction SMILES for this reaction is: “O=C(c1ccccc1)c1ccccc1.Br[Mg]c1ccccc1 >>OC(c1ccccc1)(c1ccccc1)c1ccccc1.”
Figure 3. The USPTO dataset family tree with the two versions of the text-mined dataset and the different filtered subsets used to benchmark data-driven reaction prediction models (26293536).
Reaction Prediction Approaches
Figure 4. Timeline of the recent developments of large-scale data-driven reaction prediction models that can be compared using the different USPTO reaction subsets. There are two main strategies: bond changes predictions and product molecule generation.
Figure 5. Visualization of the two chemical reaction representation settings. a) shows the separate reagents setting, where the information of which molecule contributes atoms to the product and which molecule does not is explicitly contained. Unfortunately, this requires knowing the atom-mapping and therefore, also knowing the product before making the prediction. b) In contrast, shows the mixed setting, where no distinction is made between reactants and reagents. The model must figure out itself, which molecules are the most likely to react together. The mixed setting makes the reaction prediction problem more realistic, but also more challenging.
Figure 6. Product-reactants attention generated by the Molecular Transformer for a Bromo Suzuki coupling reaction 32. Attention weights show how important an input token (horizontal) was for the prediction of an output token (vertical). The model focused on the corresponding molecule parts in the reactants, while predicting the product.
Conclusion and Outlook
References
5
Using Machine Learning To Inform Decisions in Drug Discovery: An Industry Perspective
Introduction and Scope
Figure 1. The technology hype cycle.
Screening for Hit and Tool Molecules
Figure 2. Typical plate patterns observed in an HTS experiment, which can be corrected algorithmically. Wells that contain apparent hits are colored red, with clear gradient (left) and edge (right) effects on the plates where false positives serve to obscure true signals in the data.
Figure 3. Comparison between a conventional image analysis pipeline and an approach based on multiscale CNN. (a) Starting from the raw image data, a conventional pipeline workflow carries out a series of independent data analysis steps that culminates in a prediction for the phenotype classes. Each step involves method customization, as well as parameter adjustments. (b) The multiscale CNN approach instead classifies the raw image data into phenotypes in one unbiased and automatic step. The parameters in the approach correspond to the weights of the neural network, and these are automatically optimized based on training images. Reproduced with permission from reference 20. Copyright 2017 Oxford University Press.
Computational Approaches for Hit and Tool Molecules
Figure 4. An AL screening cycle. The experiment is initiated with a starting set of compounds (maybe a diverse set) to test. An ML model is built from the results and is used to select the next compounds to test on the basis of finding more active compounds (exploitation) or of finding information that will help the model to learn (exploration).
Figure 5. The classic DMT cycle for iterative product design, as used in drug discovery and annotated with methods used for chemical structure ideation and learning as introduced in Reker and Schneider 32.
Lead Optimization
Figure 6. The chemist-centric lead optimization cycle. All data analysis, molecule design ideation, and prediction is performed or coordinated by one or more medicinal chemists in the team, who also need to decide which designs to make next.
Figure 7. Interpretative QSAR contributions using the universal approach for structural interpretation of QSAR and quantitative structure–property relationships models and the Simplex representation of molecular structure method for structure decomposition (6162). This analysis method is deployed at GSK and can be applied to any ML model. Reproduced with permission from reference 61. Copyright 2013 John Wiley & Sons 2013.
Figure 8. An example of efficient exploration of novel chemical space using 3D information. Adapted with permission from reference75. Copyright 2002 American Chemical Society.
Synthetic Tractability
Risk Management
Mechanistic Models vs ML
The Human–Machine Interface
Future Challenges
References
6
Cognitive Materials Discovery and Onset of the 5th Discovery Paradigm
Introduction
Figure 1. Relevant paradigms of discovery. A) Fundamentally, the components of the discovery remain the same; however, their relative importance varies. B) Dominance of the generalizable first-principle models rooted in differential equations and supported by advances in personal and high-performance computing is characteristic of the 3rd discovery paradigm; C) shift towards data-centric models rooted in statistical learning is an attribute of the 4th paradigm; D) elimination of the bottlenecks in the information/knowledge flows between components of the discovery and fusion enabled by domain-specific AI is perceived as a signature of the 5th paradigm.
Onset of the 5th Paradigm: Bottlenecks
Data Availability Bottleneck
Cognition Bottleneck
Actionability Bottleneck
Figure 2. Volume of publications in the materials science domain employing different computational approaches and volume of patents describing inventions in polymer materials domain (1819). A) Query A "material" AND ("machine learning" OR "deep learning" OR "artificial intelligence") (blue dots) captures the contribution of the methods associated with the 4th discovery paradigm; query B “material” AND “density functional theory” (orange dots) captures the contribution of the methods associated with the 3rd discovery paradigm; B) ratio of the number of publications in query B to query A for each year shows that the research activities are still dominated by the methods of the 3rd paradigm. C) Query C ("polymer material") AND reaction_count:([2 TO 200]) captures the volume of the inventions in the area of polymer materials synthesis. D) Year-to-year differential volume of inventions based on Figure 2C data. Queries A and B offer a proxy for the amount of labor and query C serves as a proxy for the production output. Considering year-to-year differential volume of inventions as a proxy for the marginal product and given monotonic growth of the publications volume, we notice that inventorship in polymer materials space regularly encounters periods of diminishing returns and doesn’t appear to be affected by the transition from the 3rd to 4th discovery paradigm.
Discovery Bottlenecks in Material Science
To the 5th Paradigm via Cognitive Systems
Figure 3. Example of knowledge extraction in the polymer materials domain with IBM Polymer Annotator leveraging System T capabilities. The extraction emphasizes not only tagging the entities but recognizing their role in the text and establishing relations. The quality metrics of the annotator performance are shown in the bottom panel.
Figure 4. Example of a basic ontology for the polymer materials domain. Extended relations include impact of a chemical entity on a property, such as control of elongation at break via addition of soft-segment polymer. The described ontology captures both the cases when detailed information about the chemical entity is available (e.g., the formulation is explicitly described) and the cases when the chemical entity is represented as a “bag of components” (e.g., polymer is described as a backbone with a variety of pendant groups).
Figure 5. Binary label propagation on a similarity network of chemical entities. Nodes are chemical entities (e.g., polymer materials). The nodes are connected via edges if an operationally defined measure of similarity of the respective entities exceeds some threshold motivated by the task at hand. Red represents an “irrelevant” class, blue represents a “relevant” class, and grey color represents an unknown class in the beginning of the process, and an unknown/uncertain class at the end of the process. A) Initial label assignment, the seeding labels acquired from human SMEs or other suitable sources. As the labels propagate along the edges, they become progressively less reliable due to the dissimilarity of linked entities and separation of the labeled and unlabeled entities on the network. B) Label assignment in the end of the process. The stopping criteria depend on the problem at hand and the label propagation algorithm. The entities with low confidence of the inferred labels are in grey color. The cutoff for the confidence of the label assignment is defined operationally. In practice, the final grey labeled nodes are often interesting, as they represent the frontier interface between known “relevant” and “irrelevant” materials.
Conclusion
Acknowledgments
References
Editors’ Biographies
Edward O. Pyzer-Knapp
Teodoro Laino
Indexes
Author Index
Subject Index
A
C
D
M
S
T

📜 SIMILAR VOLUMES

Machine Learning Algorithms Popular algo

📁 Machine Learning Algorithms Popular algorithms for data science and machine learning

✍ Giuseppe Bonaccorso 📂 Library 📅 2018 🏛 Packt 🌐 English

Machine learning has gained tremendous popularity for its powerful and fast predictions with large datasets. However, the true forces behind its powerful output are the complex algorithms involving substantial statistical analysis that churn large datasets and generate substantial insight. This s

Machine Learning and Security: Protectin

📁 Machine Learning and Security: Protecting Systems with Data and Algorithms

✍ Clarence Chio; David Freeman 📂 Library 📅 2018 🏛 O’Reilly Media 🌐 English

Can machine learning techniques solve our computer security problems and finally put an end to the cat-and-mouse game between attackers and defenders? Or is this hope merely hype? Now you can dive into the science and answer this question for yourself. With this practical guide, you'll explore ways

Machine learning and security: protectin

📁 Machine learning and security: protecting systems with data and algorithms

✍ Chio, Clarence;Freeman, David 📂 Library 📅 2018 🏛 O'Reilly Media 🌐 English

Machine learning and security: protectin

📁 Machine learning and security: protecting systems with data and algorithms

✍ Chio, Clarence;Freeman, David 📂 Library 📅 2018 🏛 O'Reilly Media, Inc. 🌐 English

Machine learning and security: protectin

📁 Machine learning and security: protecting systems with data and algorithms

✍ Chio, Clarence;Freeman, David 📂 Library 📅 2018 🏛 O'Reilly Media 🌐 English

Data-Driven Science and Engineering: Mac

📁 Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control

✍ Steven L. Brunton, J. Nathan Kutz 📂 Library 📅 2019 🏛 Cambridge University Press 🌐 English

Data-driven discovery is revolutionizing the modeling, prediction, and control of complex systems. This textbook brings together machine learning, engineering mathematics, and mathematical physics to integrate modeling and control of dynamical systems with modern methods in data science. It highligh