Extraction of information about the molecule structure directly from GC-MS data
Abstract
Gas chromatography – mass spectrometry (GC-MS) is a very important method of chemical analysis. GC-MS can be used for non-target chemical analysis and preliminary screening of completely unknown compounds. Electron ionization mass spectrometry is commonly used in GC-MS. Some information can be extracted directly from GC-MS data using machine learning methods. There are several previous works in which machine learning models extract information about the presence or absence of given substructures in a molecule directly from the electron ionization mass spectrum. Rarely, the additional data such as molecular weight and retention index are used together with the mass spectrum as input features of such models, however, no systematic comparison of how the use of such data increases the accuracy of the prediction was previously conducted. In this work, gradient boosting was used for prediction of the presence or absence of given substructures in a molecule. The following substructures were considered: aromatic ring, 5-membered aromatic ring, 6-membered aromatic ring without heteroatoms (benzene ring), nitrogen-containing aromatic ring, primary, secondary, and tertiary amino groups, nitrile, hydroxyl, carbonyl, methoxy, methyl, and carboxyl groups. Three types of additional features were used: molecular weight and neutral loss spectra (molecular weight also allows for the neutral loss spectra computation), retention index for the non-polar stationary phase, and retention index for the polar stationary phase. A total of 8 feature sets were considered. In most cases, the molecular weight and neutral loss spectrum considerably improve the accuracy. Retention indices also allow for further accuracy increase. For polar functional groups such as carbonyl and hydroxyl, the effect of using retention indices is maximal. The use of retention indices for two stationary phases allows for the achievement of the best accuracy. The best accuracy of prediction was achieved for the benzene ring and aromatic ring, the worst (but still high) accuracy was observed for the secondary amino group. The achieved accuracy was compared with the previous results. In addition to the classification tasks, the regression tasks were considered. The gradient boosting models that predict the number of aromatic atoms, methyl groups, and benzene rings were developed. It was observed that the use of additional features considerably improves the accuracy in this case. Finally, it should be noted that the regression models underestimate the number of occurrences when the number is high.
Downloads
References
Ohoro C.R., Adeniji A.O., Okoh A.I., Okoh O.O., Distribution and Chemical Analysis of Pharmaceuticals and Personal Care Products (PPCPs) in the Environmental Systems: A Review, International jour-nal of environmental research and public health. 2019; 16(17): 3026. https://doi.org/10.3390/ijerph16173026
Nika M. C., Alygizakis N., Arvaniti O. S., Thomaidis N. S., Non-target screen-ing of emerging contaminants in landfills: A review, Current Opinion in Environmental Science & Health. 2023; 32: 100430. https://doi.org/10.1016/j.coesh.2022.100430
Beale D. J., Pinu F. R., Kouremenos K. A., Poojary M. M. et al., Review of re-cent developments in GC-MS approaches to metabolomicsbased research, Metabolomics. 2018; 14(11): 152. https://doi.org/10.1007/s11306-018-1449-2
Qiu F., Lei Z., Sumner L.W., MetEx-pert: An expert system to enhance gas chromatography‒mass spectrometry-based metabolite identifications, Analytica Chimica Acta. 2018; 1037: 316-326. https://doi.org/10.1016/j.aca.2018.03.052
Vinaixa M., Schymanski E. L., Neumann S., Navarro M. et al., Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects, TrAC Trends in Analytical Chemistry. 2016; 78: 23-35. https://doi.org/10.1016/j.trac.2015.09.005
Moorthy A.S., Wallace W.E., Kears-ley A.J., Tchekhovskoi D.V., Stein S.E., Combining Fragment-Ion and Neutral-Loss Matching during Mass Spectral Library Searching: A New General Purpose Algo-rithm Applicable to Illicit Drug Identifica-tion, Analytical chemistry. 2017; 89(24): 13261-13268. https://doi.org/10.1021/acs.analchem.7b03320
Schymanski E.L., Meinert C., Meringer M., Brack W., The use of MS classifiers and structure generation to assist in the identification of unknowns in effect-directed analysis, Analytica Chimica Acta. 2008; 615(2): 136-147. https://doi.org/10.1016/j.aca.2008.03.060
Allen F., Pon A., Greiner R., Wishart D., Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification, Analytical chemistry. 2016; 88 (15): 7689-7697. https://doi.org/10.1021/acs.analchem.6b01622
Wei J.N., Belanger D., Adams R.P., Sculley D., Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks, ACS central science. 2019; 5(4): 700-708. https://doi.org/10.1021/acscentsci.9b00085
Zhu R.L., Jonas E., Rapid Approximate Subset-Based Spectra Prediction for Electron Ionization–Mass Spectrometry, Analytical chemistry. 2023; 95 (5): 2653-2663. https://doi.org/10.1021/acs.analchem.2c02093
Ji H., Deng H., Lu H., Zhang Z., Predicting a Molecular Fingerprint from an Electron Ionization Mass Spectrum with Deep Neural Networks, Analytical chemistry. 2020; 92(13): 8649-8653. https://doi.org/10.1021/acs.analchem.0c01450
Ljoncheva M., Stepišnik T., Kosjek T., Džeroski S., Machine learning for identification of silylated derivatives from mass spectra, Journal of Cheminformatics. 2022; 14(1): 62. https://doi.org/10.1186/s13321-022-00636-1
Yoshida H., Leardi R., Funatsu K., Varmuza K., Feature selection by genetic algorithms for mass spectral classifiers, Analytica Chimica Acta. 2001; 446(1-2): 483-492. https://doi.org/10.1016/S0003-2670(01)00910-2
Varmuza K., Werther W., Mass Spectral Classifiers for Supporting Systematic Structure Elucidation, Journal of Chemical Information and Computer Sciences. 1996; 36(2): 323-333. https://doi.org/10.1021/ci9501406
Hummel J., Strehmel N., Selbig J., Walther D., Kopka J., Decision tree supported substructure prediction of metabolites from GC-MS profiles, Metabolomics. 2010; 6(2): 322-333. https://doi.org/10.1007/s11306-010-0198-7
Xiong Q., Zhang Y., Li M., Comput-er-assisted prediction of pesticide substruc-ture using mass spectra, Analytica Chimica Acta. 2007; 593(2): 199-206. https://doi.org/10.1016/j.aca.2007.04.060
Stein S.E., Chemical substructure identification by mass spectral library searching, Journal of the American Society for Mass Spectrometry. 1995; 6(8): 644-655. https://doi.org/10.1016/S1044-0305(05)80054-6
Meringer M., Schymanski E., Small Molecule Identification with MOLGEN and Mass Spectrometry, Metabolites. 2013; 3(2): 440-462. https://doi.org/10.3390/metabo3020440
Matyushin D.D., Sholokhova A.Yu., Buryak A.K., A deep convolutional neural network for the estimation of gas chromatographic retention indices, Journal of Chromatography A. 2019; 1607: 460395. https://doi.org/10.1016/j.chroma.2019.460395
Chen T., Guestrin C., XGBoost: A Scalable Tree Boosting System, 2016, Proceedings of the 22nd ACM SIGKDD Inter-national Conference on Knowledge Dis-covery and Data Mining, San Francisco California USA: ACM, pp. 785-794. https://doi.org/10.1145/2939672.2939785
Jin Huang, Ling C.X., Using AUC and accuracy in evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering. 2005; 17(3): 299-310. https://doi.org/10.1109/TKDE.2005.50
Sholokhova A.Y., Matyushin D.D., Grinevich O.I., Borovikova S.A., Buryak A.K., Intelligent Workflow and Software for Non-Target Analysis of Complex Sam-ples Using a Mixture of Toxic Transformation Products of Unsymmetrical Dimethylhydrazine as an Example, Molecules. 2023; 28(8): 3409. https://doi.org/10.3390/molecules28083409