Beyond the Moon
The launch of a new data sharing initiative in cancer genomics has been welcomed by doctors and data scientists alike. To make the most of this unique opportunity, we can’t lose sight of the practicalities.
Jens Hoefkens |
Over the past few months, the scientific community has responded eagerly to the creation of the National Cancer Institute’s Genomic Data Commons (GDC) – a first-of-its kind, open-access cancer database that will ultimately help advance Vice President Joe Biden’s Cancer Moonshot Initiative.
The GDC is a step in the right direction and has the potential to help the scientific community advance their understanding of complex diseases, such as cancer. Public data sets, including The Cancer Genome Atlas (TCGA) and the 1000 Genomes Project, have already contributed to our evolving understanding of, and approach to, disease research. For example, according to the National Cancer Institute and National Human Genome Research Institute, the publicly available TCGA dataset includes 2.5 petabytes of data from over 11,000 patients, and has already contributed to more than a thousand cancer studies for 33 types of cancer. And though most agree that greater data sharing will benefit cancer researchers, the details of how best to support such a monumental database are less clear and present a number of interesting challenges. Through my experience working with customers and their multitude of research partners, I know that developing the necessary infrastructure to support the integration of data from varying sources – and different types – will be the cornerstone of success for this unique database.
As we’ve seen in other examples of translational research and pharmaceutical R&D, increasingly large datasets from diverse high-content methodologies, such as genomics, are typically stored in silos, which makes access and searching more difficult (or impossible). Here are some of the most common challenges we’ve seen researchers and scientists encounter when trying to integrate disparate data sources and varieties:
- Availability of data. The willingness and ability of researchers to share their data varies; some organizations may not want to share proprietary information about their genomic trials.
- Consent and legal issues. Publication of data may not be a part of patient consent procedures.
- Scope. Although genomics is an important piece of translational medicine, there are many other profiling technologies not supported by the GDC. A good example comes from PerkinElmer’s Quantitative Pathology team. While PD-L1 expression (genomics in nature) is an important biomarker for cancer immunotherapies, studies have shown that spatial distribution of immune system cells around the tumor can also be a predictor of response to treatment. The digital pathology data required for this kind of analysis are not currently in scope for the GDC.
- Access control. The GDC is designed to be an open platform and has little focus on restrictions. Though an open-access strategy makes sense for sharing public data, access controls are an important and complex part of a commercial solution dealing with clinical data.
Researchers need – and want – to be able to easily aggregate internal and external data, while maintaining their focus on the science. Complementary systems offered by experienced and specialized companies can help mitigate the challenges and advance collaborative efforts. Ultimately, the data that need to be integrated fall into three categories: public data, in-house data that could be public, and in-house data that cannot be public. The GDC gives companies the tools to make the second kind of data public. However, it’s the integration tools that will allow companies to merge their proprietary data (e.g., data from ongoing clinical trials or patient data from patients who have not consented for their data to be publicly available). These integration solutions can contribute to greater insights and faster conclusions about potential treatments. With self-service access to a wide variety of data, researchers can more efficiently identify and manage biomarkers, which could help to streamline the development of drugs tailored to unique health needs.
Data sharing in itself is not enough to accelerate cures – researchers also need the appropriate tools to interpret, visualize, and analyze data. Ultimately, ensuring the success of the GDC may not only lead us closer to a cure for cancer, but also transform the way we approach translational research for a wide range of additional diseases.