Statistical models for the analysis of whole genome bisulfite sequencing datasets
Description
DNA methylation is an epigenetic mechanism for altering DNA-protein interactions that plays an important role in regulation of the genome. Whole genome bisulfite sequencing (WGBS) provides high-resolution methylation profiles and enables the detection of differential methylation (DM) associated with numerous sources including disease, age, and environmental factors. Existing algorithms for DM analysis do not account for several properties of WGBS data, including correlation between methylated (M) and unmethylated (U) counts and spatial variation induced by the sequencing process. We will introduce a new algorithm to detect differentially methylated CpG sites (DMCs) and differentially methylated regions (DMR) for WGBS in a collection of samples representing two distinct groups. A spatial Poisson regression model with random effects (denoted SPORE) is first fit to account for common sources of variability in the methylation profiles, and residuals from SPORE are then analyzed using a hidden Markov model (HMM) to identify clusters of sites that differ between the two groups. Additional processing identifies likely DMRs and improves specificity. Applying this approach to compare methylation profiles in two highly disparate family groups, we demonstrate that our results compare favorably to other popular algorithms.