Over the past few months, the scientific community has responded eagerly to the creation of the National Cancer Institute’s Genomic Data Commons (GDC) – a first-of-its kind, open-access cancer database that will ultimately help advance Vice President Joe Biden’s Cancer Moonshot Initiative.
The GDC is a step in the right direction and has the potential to help the scientific community advance their understanding of complex diseases, such as cancer. Public data sets, including The Cancer Genome Atlas (TCGA) and the 1000 Genomes Project, have already contributed to our evolving understanding of, and approach to, disease research. For example, according to the National Cancer Institute and National Human Genome Research Institute, the publicly available TCGA dataset includes 2.5 petabytes of data from over 11,000 patients, and has already contributed to more than a thousand cancer studies for 33 types of cancer. And though most agree that greater data sharing will benefit cancer researchers, the details of how best to support such a monumental database are less clear and present a number of interesting challenges. Through my experience working with customers and their multitude of research partners, I know that developing the necessary infrastructure to support the integration of data from varying sources – and different types – will be the cornerstone of success for this unique database.