“Starting with TCGA, our goal is to make large data sets available to the average researcher who would not otherwise be able to access this information,” said lead author Rebecca Jacobson, M.D., M.S., professor of biomedical informatics at Pitt’s School of Medicine
and chief information officer of Pitt’s Medicine. “There’s a growing understanding that further advances in health care are going to require a previously unseen level of data-sharing, which will require new tools. That’s particularly true in cancer research, as recognized by the major focus on data-sharing in Vice President Joseph Biden’s recently announced Cancer Moonshot
“This work is about enabling and speeding up science,” said Adrian Lee, Ph.D., director of IPM and of UPCI’s Women’s Cancer Research Center
, and a co-author on the new paper. “Resources such as this will be key in our move to precision cancer genomic medicine.”
Fundamentally, all cancers are caused by an overgrowth of cells due to an error in DNA. Examining a cancer’s complete set of DNA, or genome, can provide insights into many aspects of tumor biology. The goal of TCGA, a collaborative effort of the National Cancer Institute
and the National Human Genome Research Institute
, is to collect and share genomic data from cancers with poor prognoses and the greatest impacts on public health. To date, the project has profiled 33 different cancers from more than 11,000 patients, and the resulting data has been used in more than 1,000 cancer studies.
“These very large data sets are incredibly hard to work with because they are enormous, not only in terms of the amount of digital storage space they need, but also in terms of the complexity of software and computational processing power that they require,” Dr. Jacobson said. “Right now, our institutions are choking on data.”
The new software continuously downloads, processes and manages the TCGA data, allowing researchers to take the tools that they need and apply them to making cancer discoveries.
The team then put the new software to work, creating an information technology framework called the Pittsburgh Genome Resource Repository
to allow approved Pitt researchers to use the TCGA data much more effectively. While initially designed for TCGA data, the new software can also be used with other large data sets, and is already a key part of several other big data projects PGRR supports, such as the National Institutes of Health’s Big Data to Knowledge
initiative and Pennsylvania’s Commonwealth Universal Research Enhancement
The hope is that the benefits of TCGA Expedition will extend well beyond Pittsburgh.
“The fact that we made our software open source and freely available demonstrates our commitment to taking the advances in using big data sets and data-sharing that we make here and helping other institutions make their own advances,” Dr. Jacobson said.
Additional collaborators on the project included Uma Chandran, Ph.D., M.S.I.S., Olga Medvedeva, M.S., M. Michael Barmada, Ph.D., Anish Chakka, M.S., Soumya Luthra, M.S., Antonio Ferreira, Ph.D., Kim Wong, Ph.D., Jeremy Berg, Ph.D., and Annerose Berndt, Ph.D., D.V.M., all of Pitt; Philip Blood, Ph.D., Zhihui Zhang, Ph.D., Robert Budden, B.S., and J. Ray Scott, B.A., of Carnegie Mellon University