The Cancer Data Science Initiatives team builds interdisciplinary collaborations to develop innovative approaches in data management and advanced scientific computing that will improve treatment for cancer patients. 

We work collaboratively with multidisciplinary teams and organizations across the research community to accelerate cancer and biomedical research.  

With expertise in artificial intelligence, data science, data management and scientific computing, we provide leadership and bring diverse groups of experts together to develop cutting-edge approaches for complex research challenges. This includes the development of computational infrastructure to enable scientific computing at scale, and deep-learning technologies for cancer research. 

We develop innovative data- management resources that streamline the transformation data to FAIR data resources to foster development of new data science innovations.  

We also built a data and model clearinghouse that enables the research community to access newly developed software, computational and AI models, and key datasets. 

We host interdisciplinary workshops to foster broad adoption of cutting-edge AI and data science analytic tools and capabilities developed by the National Cancer Institute, Frederick National Laboratory and the NCI-DOE Collaboration.  


Participation in the ATOM Consortium 

  • Co-founded the Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium. 

  • Collaborate to transform drug discovery from a slow and high-failure process into a rapid, integrated and patient-centric model. 

  • Integrate high-performance computing, diverse biological data, and emerging biotechnologies into drug discovery and optimization. 

  • Develop new platforms and innovations : 

    • AMPL software to generate machine learning models that can predict key safety and pharmacokinetic-relevant parameters  

    • Generative molecular design to determine design criteria that consider pharmacology, safety, efficacy, and developability 

    • An active-learning design platform that enables researchers to selectively incorporate results from mechanistic simulation and human-relevant experimentation to generate and optimize new drug candidates 

High-performance computing and AI support services 

  • Provide consultation and development support to accelerate applications on high-performance computing platforms including use of GPUs. 

  • Support and enable scalable workflows for National Cancer Institute environments. 

  • Deliver and enable use of AI and machine learning platforms including CANDLE, AMPL, and general environments such as TensorFlow. 

 Data management services 

  • Created the Data Management Environment, which provides a flexible, extensible, and generic interface with emerging object data storage capabilities to support the rapidly expanding volume of laboratory data. 

    • Offers centralized data management that safeguards high-value datasets and enables enterprise data assets to become findable, accessible, interoperable, and reusable (FAIR).  

    • Data Management Environment is now the core element of the NIH Integrated Data Analysis Platform.   

  • Expanded the National Cancer Institute Data Management Services to support over 650 TB of scientific data from nine user groups across the institute. 

  • Developed innovative features and capabilities for the powerful data analysis platform without disruption to system availability.  

  • Implemented seamless integration with multiple data storage systems (Isilon storage, Cleversafe and Cloudian object storage) and support for commercial cloud services including AWS and Google. 

  • Provide advanced capabilities: 

    • Enhanced metadata management features. 

    • Additional data protection protocols to support protected health information. 

    • Special purpose interfaces for scientists and administrators alike. 

Hosting AI and data science workshops 

  • Create, organize, and participate in workshops and tutorials. 

  • Support the NCI Data Science Learning Exchange events. 

  • Co-host the annual Computational Approaches for Cancer Workshop, held at the International Supercomputing Conference. 

    • Provide leadership in program development, speaker recruitment, and community building. 

    • Co-founded the workshop in 2015 with the Icahn School of Medicine at Mount Sinai. 

  • Support Envisioning Computational Innovations for Cancer Challenges Community (ECICC) events and workshops, which have seen participation from scientists and clinicians from over 200 international organizations in academia, government, cancer institutes and research hospitals, and industry. 

    • Provide leadership in community building and engagement in this multidisciplinary community of researchers dedicated to accelerating predictive oncology. 

  • Participate in National Cancer Institute and Department of Energy Collaborations that help seed new partnerships through interactive workshops such as the 2020 Virtual Ideas Lab: Toward Building a Cancer Patient "Digital Twin" and Accelerating Precision Radiation Oncology through Advanced Computing and Artificial Intelligence.


 ATOM Modeling Pipeline (AMPL

  • An open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery. 

  • Extends the functionality of DeepChem and supports an array of machine learning and molecular featurization tools. 

  • An end-to-end, data-driven modeling pipeline to generate machine learning models that can predict key safety and pharmacokinetic-relevant parameters. 

  • Benchmarked on a large collection of pharmaceutical datasets covering a wide range of parameters.  

  • Published in the Journal of Chemical Information and Modeling.

Model and Data Clearinghouse (MoDaC

  • Provides public access to models and data developed through the ongoing NCI-DOE collaboration

  • Enables storage and sharing of large, annotated data sets. 

  • Downloads can be performed asynchronously to a Globus endpoint or an AWS S3 bucket, or synchronously to user’s computer. 

  • See the user guide for more information. 

Cancer Distributed Learning Environment (CANDLE

  • Open source, Deep Learning software platform brings AI acceleration to multiple cancer research areas: 

    • DOE Exascale Computing Project.  

    • Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) efforts in tumor response, RAS-membrane biology, and cancer surveillance. 

    • Extended applications to multiple other areas including image analysis. 

  • Scalable: Locally runnable while efficiently scaling on the world’s most powerful supercomputers, including the NIH Biowulf system 

  • Hyperparameter optimization (HPO) of machine/deep learning models using either grid or Bayesian search 

  • View a workshop presentation to learn more