Preprint at ChemRxiv
Abstract
Our unsupervised clustering technique, VOCCluster, prototyped in Python, handles features of deconvolved GC-MS breath data. VOCCluster was created from a heuristic ontology based on the observation of experts undertaking data processing with a suite of software packages. VOCCluster identifies and clusters groups of volatile organic compounds (VOCs) from deconvolved GC-MS breath with similar mass spectra and retention index profiles.
VOCCluster was used to cluster more than 15,000 features extracted from 74 GC-MS clinical breath samples obtained from participants with cancer before and after a radiation therapy. VOCCluster was able to cluster those features into 1081 groups (including endogenous,exogenous compounds and instrumental artifacts) with an accuracy rate of 96% (±0.04 at 95% confidence interval). Results were evaluated against a panel of ground truth compounds, and compared to other clustering methods used in previous metabolomics studies such as DBSCAN and OPTICS.