New Big Data Tools for Single-Cell Omics

Recent progress in developing high-throughput single-cell methods allows researchers to study cell types and states of a tremendous number of cells. Further advances and focused international initiatives, such as the Human Cell Atlas, will likely allow the number of cells that can be analyzed to grow even further, to hundreds of millions of cells and beyond. Deriving biological insights from such massive datasets requires new tools for big data visualization and exploration. In particular, understanding the molecular programs that guide differentiation during development is a major challenge. I will describe two projects that tackle different challenges within single-cell data analysis: Waddington-OT and scSVA. My colleagues and I applied Waddington-OT to reconstruct the landscape of mouse embryonic fibroblasts (MEFs) reprogrammed to induced pluripotent stem cells (iPSCs) from 315,000 single-cell RNA-Seq profiles. The analysis predicts transcription factors and paracrine signals that affect cell fates, and experiments validate that the transcription factor Obox6 and the cytokine GDF9 enhance reprogramming efficiency. scSVA (single-cell Scalable Visualization and Analytics) relies on advances from diverse data-heavy areas, especially astronomy, to scale up most of its capabilities to a billion cells with real time interactivity. To reduce memory usage, scSVA supports efficient retrieval of cell features from massive expression matrices stored on a disk. To facilitate reproducible research, scSVA supports interactive analytics in the cloud with containerized tools. Thus, scSVA should enable users to interact with large datasets and complex analytics to yield novel insights and discoveries.