Feature Selection in Microbiome Bioinformatics: Part 1
22nd January, 2024
Have you ever encountered the challenge of dealing with an abundance of features or taxa in your data? Is your machine learning model showing signs of overfitting to the training data? Feature selection could be the solution to your problem.
In clinical microbiome bioinformatics, we often face the scenario of having a low number of samples, typically 100 or fewer, while the number of features can be considerably larger. Some machine learning algorithms struggle with this type of data, while others prove to be more robust. The logical solution to this problem is to select the most informative features relevant to your specific issue. However, accomplishing this task can be tricky.
Feature Selection Approaches
In the process of building a machine learning model, engineers must be vigilant at every step to ensure the separation of training and testing sets. In certain microbiome studies, researchers take the entire dataset and conduct statistical tests on each feature to determine if their abundance significantly differs according to the target variable. Subsequently, they proceed to build a machine learning model using the selected features. While this may seem like a viable approach, it inadvertently violates one of the fundamental rules of machine learning model building. Information from the testing fold leaks into the training fold when features are selected using the entire dataset. Feature selection should always occur exclusively within the training data, as without a separate validation dataset, you cannot reliably estimate the performance of your model.
An example in the realm of colorectal cancer prediction based on oral microbiota involved the use of the least absolute shrinkage and selection operator (LASSO) as a feature selection step before training the predictive random forest model (1). In this study, feature selection was conducted properly, within the training fold. Another study employed two layers of feature selection when predicting colorectal cancer (2). Initially, low abundance features were removed using a 10-fold cross-validation loop that maximized the AUROC of the models. Subsequently, a recursive feature elimination loop was employed to eliminate the least important features based on random forest feature importance (For more on recursive feature elimination, see references 3 and 4). However, it appears that the researchers executed the feature selection procedure on the entire dataset before inputting it into the predictive models.
Thank you for taking the time to read this week’s post on microbiome bioinformatics.
References
- Flemer, B., Warren, R. D., Barrett, M. P., Cisek, K., Das, A., Jeffery, I. B., Hurley, E., O’Riordain, M., Shanahan, F., & O’Toole, P. W. (2018). The oral microbiota in colorectal cancer is distinctive and predictive. Gut, 67(8), 1454–1463. https://doi.org/10.1136/gutjnl-2017-314814
- Yachida, S., Mizutani, S., Shiroma, H., Shiba, S., Nakajima, T., Sakamoto, T., Watanabe, H., Masuda, K., Nishimoto, Y., Kubo, M., Hosoda, F., Rokutan, H., Matsumoto, M., Takamaru, H., Yamada, M., Matsuda, T., Iwasaki, M., Yamaji, T., Yachida, T., Soga, T., … Yamada, T. (2019). Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nature medicine, 25(6), 968–976. https://doi.org/10.1038/s41591-019-0458-7
- https://machinelearningmastery.com/rfe-feature-selection-in-python/
- https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py