Abstract
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction-representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a new pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results in performance of challenging downstream tasks. By possessing chemical knowledge, our generative framework overcomes the limitations of current molecule generation models that rely on a small number of reaction templates. In extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a noteworthy step toward a large-scale deep-learning framework for a variety of reaction-based applications.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The USTPO MIT was downloaded from the official Github repository (https://github.com/wengong-jin/nips17-rexgen) and Schneider datasets were downloaded from the Supplementary Information of the original article9 (https://pubs.acs.org/doi/suppl/10.1021/ci5006614/suppl_file/ci5006614_si_002.zip). We provide our processed training data in python pickle format at https://doi.org/10.5281/zenodo.8075066 (ref. 42).
Code availability
The code to reproduce the results and Python scripts to reproduce the training data are publicly available at https://github.com/qiangbo1222/Uni-RXN-official (ref. 43).
References
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Hendrycks, D. et al. Pretrained transformers improve out-of-distribution robustness. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 2744–2751 (Association for Computational Linguistics, 2020).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).
Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).
Lowe, D. Chemical reactions from US patents (1976-Sep2016). figshare https://doi.org/10.6084/m9.figshare.5104873.v1 (2017).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).
Probst, D., Schwaller, P. & Reymond, J.-L. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digit. Discov. 1, 91–97 (2022).
Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021).
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pretrained transformer for computational chemistry. Mach. Learn. 3, 015022 (2022).
Wen, M., Blau, S. M., Xie, X., Dwaraknath, S. & Persson, K. A. Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining. Chem. Sci. 13, 1446–1458 (2022).
Wang, H. et al. International Conference on Learning Representations (ICLR, 2022).
NameRXN (Nextmove Software, 2021); http://www.nextmovesoftware.com/namerxn.html
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. 2, 015016 (2021).
Korovina, K. et al. ChemBO: Bayesian optimization of small organic molecules with synthesizable recommendations. In Proc. 23rd International Conference on Artificial Intelligence and Statistics (eds Chiappa, S. & Calandra, R.) 3393–3403 (PMLR, 2020).
Button, A., Merk, D., Hiss, J. A. & Schneider, G. Automated de novo molecular design by hybrid machine intelligence and rule-driven chemical synthesis. Nat. Mach. Intell. 1, 307–315 (2019).
Gao, W., Mercado, R. & Coley, C. W. International Conference on Learning Representations (ICLR, 2022).
Noh, J. et al. Path-aware and structure-preserving generation of synthetically accessible molecules. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16952–16968 (PMLR, 2022).
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler–Lehman network. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 2604–2613 (Curran Associates Inc., 2017).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. & Hernández-Lobato, J. M. A model to search for synthesizable molecules. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 7937–7949 (Curran Associates Inc., 2019).
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. & Hernández-Lobato, J. M. Barking up the right tree: an approach to search over molecule synthesis DAGs. Adv. Neural Inf. Process. Syst. 33, 6852–6866 (2020).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).
Genheden, S., Engkvist, O. & Bjerrum, E. J. A quick policy to filter reactions based on feasibility in AI-guided retrosynthetic planning. Preprint at chemRxiv https://doi.org/10.26434/chemrxiv.13280495.v1 (2020).
Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 46, 1074–1082 (2018).
Fialková, V. et al. LibINVENT: reaction-based generative scaffold decoration for in silico library design. J. Chem. Inf. Model. 62, 2046–2063 (2021).
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
Thakkar, A., Chadimov´a, V., Bjerrum, E. J., Engkvist, O. & Reymond, J.-L. Retrosynthetic accessibility score (RAscore)–rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem. Sci. 12, 3339–3349 (2021).
Morris, A. et al. Discovery of sars-cov-2 main protease inhibitors using a synthesis-directed de novo design model. Chem. Commun. 57, 5909–5912 (2021).
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach. Learn. 1, 045024 (2020).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 6000–6010 (Curran Associates Inc., 2017).
Ying, C. et al. Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. 34, 28877–28888 (2021).
Zhang, L., Xu, D., Arnab, A. & Torr, P. H. Dynamic graph message passing networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3726–3735 (2020).
Jacob, P.-M. & Lapkin, A. Statistics of the network of organic chemistry. React. Chem. Eng. 3, 102–118 (2018).
Vignac, C. & Frossard, P. International Conference on Learning Representations (ICLR, 2022).
Chen, S. & Jung, Y. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nat. Mach. Intell. 4, 772–780 (2022).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Friesner, R. A. et al. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J. Med. Chem. 49, 6177–6196 (2006).
Qiang, B. Processed training data for ‘Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model’. Zenodo https://doi.org/10.5281/zenodo.8075067 (2023).
Qiang, B. qiangbo1222/Uni-RXN-official V1.0. Zenodo https://doi.org/10.5281/zenodo.8113249 (2020).
Reymond Group: DRFP. GitHub https://github.com/reymond-group/drfp (2023).
Acknowledgements
This work was financially supported by National Key R&D Programme of China (grant no. 2022YFF1203003 (Z.L.) and grant no. 2022YFC2303700 (L.Z.)), Beijing AI Health Cultivation Project (grant no. Z221100003522022 (Z.L.)), Peking University Health Science and StoneWise Technology Joint Laboratory Project (grant no. L202107 (Z.L.)) and the Open Fund of State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, China (grant no. KF-202304 (Z.L.)).
Author information
Authors and Affiliations
Contributions
B.Q. conceived the initial idea for the projects. B.Q. and Y.D. processed the dataset and trained the model. B.H. provided support on computing resources. B.Q. and Y.Z. performed the experiments using the pretrained model and the generative model. Y.Z. analysed the results and B.Q. wrote the manuscript. B.Q., S.S., L.Z. and Z.L. contributed to the revision of the manuscript. The project was supervised by L.Z. and Z.L. All authors participated in discussions.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Esben Jannik Bjerrum and Thomas Blaschke for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Related works and details of experiments and implementation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qiang, B., Zhou, Y., Ding, Y. et al. Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model. Nat Mach Intell 5, 1476–1485 (2023). https://doi.org/10.1038/s42256-023-00764-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-023-00764-9