'Long-read sequencing methods for studying genome organization, evolution, and function'

Abstract: Reference genome assemblies have historically excluded repetitive DNA sequences found in and near centromeres, telomeres, and ribosomal DNA regions, limiting the ability to study their organization, evolution, and function using modern genomic and epigenomic tools. Thanks to recent advances in long-read sequencing and assembly technologies, the Telomere-to-Telomere Consortium produced the first truly complete assembly of a human genome, including formerly missing repetitive regions that make up ~8% of the genome. To make sense of these newly assembled regions, we created and applied tailored sequence analysis tools to reveal the organization and evolutionary relationships of tandemly repeated DNA sequences, with a focus on centromeric regions. We also developed DiMeLo-seq, a method that uses nanopore sequencing to map protein-DNA interactions and endogenous DNA methylation marks at high resolution on long, single molecules of DNA, whose sequences can be mapped reliably to repetitive regions. We applied DiMeLo-seq to measure the density of Centromere Protein A (CENP-A) across human centromeres, revealing strong associations between low CpG methylation, high CENP-A density, and the very recent expansion of underlying tandem repeats, raising important questions about the molecular and evolutionary mechanisms responsible for these associations. Ongoing work continues to extend the DiMeLo-seq method and to leverage its many advantages for studying chromatin organization at the single-molecule level. My lab at Stanford will continue to develop new experimental and analytical approaches for understanding fundamental chromatin biology in the most challenging regions of the human genome.