High-dimensional statistical data integration
Description
Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A representative model for the integrative analysis of multiple data types is to decompose each data matrix into a low-rank common-source matrix generated by latent factors shared across all data types, a low-rank distinctive-source matrix corresponding to each data type, and an additive noise matrix. We propose a novel decomposition method, called the decomposition-based generalized canonical correlation analysis, which appropriately defines those matrices by imposing a desirable orthogonality constraint on distinctive latent factors that aims to sufficiently capture the common latent factors. To further delineate the common and distinctive patterns between two data types, we propose another new decomposition method, called the common and distinctive pattern analysis. This method takes into account the common and distinctive information between the coefficient matrices of the common latent factors. We develop consistent estimation approaches for both proposed decompositions under high-dimensional settings, and demonstrate their finite-sample performance via extensive simulations. We illustrate the superiority of proposed methods over the state of the arts by real-world data examples obtained from The Cancer Genome Atlas and Human Connectome Project.