Nature is an inexhaustible source of new therapeutics, and humans have been using it to heal themselves for ages. Shaped by evolution, plants, bacteria, or fungi, among other living organisms, contain hundreds of potential bioactive metabolites, among which many are still to be discovered. Today, Ultra-High Performance Liquid Chromatography (UHPLC) coupled with High-Resolution tandem Mass Spectrometry (HRMS/MS) is the standard method to chemically profile these organisms and get an insight into their composition. Thanks to modern data processing workflows and annotation strategies, large amounts of spectral and structural data can be obtained and linked to biological screening results for each natural extract. In this thesis project, we aimed to develop innovative metabolomics workflows using UHPLC-HRMS/MS and biological screening results of large natural extracts collections to identify bioactive compounds in mixtures. The developed approach should help rationalize efforts to isolate valuable natural products (NP) only. As an application case, we used metabolomics and anti-trypanosomatid activity data - against Trypanosoma cruzi, T. brucei, and Leishmania donovani - obtained on a dataset of 1,600 plant extracts.
We developed a novel MS/MS-based sample vectorization method, called MEMO, to highlight chemical similarities among extracts in large chemodiverse libraries without needing any retention time (RT) based feature-alignment step. Because such an alignment makes the iterative addition of new samples difficult, we developed a pythonbased framework to organize automatically, through molecular networking, and annotate, through state-of-the-art annotation tools, the LC-MS features’ fragmentation spectra of each sample independently. While such a sample-centric workflow allows for better iterative addition of samples by avoiding the necessity to recompute post-featurealignment annotations steps, it hinders the direct comparison of the considered samples using LC-MS features’ relative intensity among samples. To maintain a way to compare unaligned samples, the generated and standardized data produced at the sample scale were integrated into a single Experimental Natural Product Knowledge Graph (ENPKG). Such a format allows the integration of both chemical and bioactivity data and the connection of experimental data to other graph databases such as Wikidata and ChEMBL. Using the SPARQL language to query the data, we could demonstrate how such an organization represents an efficient strategy to link and interpret heterogeneous chemical and biological data to annotate bioactive compounds in extracts before their isolation.To confirm the generated bioactivity annotations, we isolated and characterized some metabolites of interest from different extracts using semi-preparative HPLC, NMR, and chiroptical methods.
Because they do not require any RT-based feature alignment step and allow the easy addition of new samples over time, the MEMO method and the ENPKG workflow represent sustainable data exploitation and data management strategies in NP research. Integrating crosslinks to other DBs helps the ENPKG workflow avoid data loss into hermetic silos of information. Because of their versatility and integrative character, we believe that the developed tools should contribute to advancing the knowledge of Nature's fascinating chemistry.