Regaining control of false findings in feature selection, classification, and prediction on neuroimaging and genomics data
The technological advances of past decades have led to the accumulation of large amounts of genomic and neuroimaging data, enabling novel strategies in precision medicine. These largely rely on machine learning algorithms and modern statistical methods for big biological datasets, which are data-driven rather than hypothesis-driven. These methods often lack guarantees on the validity of the research findings. Because it can be a matter of life and death, when computational methods are deployed in clinical practice in medicine, establishing guarantees on the validity of the results is essential for the advancement of precision medicine. This thesis proposes several novel sparse regression and sparse canonical correlation analysis techniques, which by design include guarantees on the false discovery rate in variable selection. Variable selection on biomedical data is essential for many areas of healthcare, including precision medicine, population stratification, drug development, and predictive modeling of disease phenotypes. Predictive machine learning models can directly affect the patient when used to aid diagnosis, and therefore they need to be thoroughly evaluated before deployment. We present a novel approach to validly reuse the test data for performance evaluation of predictive models. The proposed methods are validated in the application on large genomic and neuroimaging datasets, where they confirm results from previous studies and also lead to new biological insights. In addition, this work puts a focus on making the proposed methods widely available to the scientific community though the release of free and open-source scientific software.