Abstract

Bioinformatics digital data are growing exponentially. They need to be stored, organized, retrieved, visualized, annotated, analyzed, integrated with experimental metadata, integrated with other various datatypes, and archived. 

These steps require a scientific team that consists of bench scientists, data engineers, bioinformaticians, data scientists, information technology, and security personnel. Collaborative and institution wide solutions provided by enterprise data science platforms make it easier for bench scientists to find data, use scientific applications, remove the reliance on a single team member, and facilitate analyses’ reproducibility. 

In addition, there are several regulatory and statutory frameworks to which platforms must adhere as well as tools and environments that users prefer. There is also a large ecosystem to address these needs, making it challenging to evaluate new entrants' offerings against existing platforms. 

In this article, we describe essential components for enterprise data science platforms that bridge the gap between data scientists and bench scientists. These include finding data, visualizing data, sharing tools and data analysis pipelines between collaborators, reproducing results, and scaling up compute. 

These components would integrate the data, software, and computational resources in one environment that would increase the number of analyses being performed by bench scientists and shorten the time to obtain results and make scientific discoveries.

Introduction

Enterprise data science platforms can greatly help in streamlining and accomplishing many scientific computing, data science, and machine learning tasks to extract and interpret information contained in primary data. These include hypothesis generation, data and patient stratification, cohort discovery of phenomenon, and drug development.

From a logistical standpoint, the analysis of scientific data in the biomedical field is increasingly becoming a team effort: analyses must be documented, reproduced, and shared with collaborators over a period of time that can span several years. Typically, one or more bench scientists perform experiments and generate the data. A data scientist, such as a bioinformatician, image analysis specialist, or biophysicist, designs analysis pipelines using open-source scientific programming languages, such as Python or R. Given the need to process increasingly large biological datasets, these pipelines are designed to take advantage of high performance scientific computing clusters (HPC) or cloud-based computing resources. Bench scientists, on the other hand, are generally trained to perform data analysis using interactive graphical user interfaces (GUI), which abstract the underlying scientific code and expose only a minimal number of parameters for the workflow to be executed (e.g., input data selection, thresholds, location for output data) [1].

To build a data science workflow, scientists must use a variety of scientific computational libraries for data management, visualization, and analysis. Designing, distributing, and maintaining reproducible analysis workflows can be problematic, as code libraries are often only available on certain platforms or operating systems or with certain versions that are not reproducible on different platforms (e.g., different versions of R or Python libraries with their corresponding dependencies). Meanwhile, browser-based methods that require zero setup are often used to publish pipelines for the community [1]. This enables many users to use the same sets of packages and hardware while removing many IT, hardware, and software obstacles to reproducing high-quality research.

Figure 1 shows a high-level diagram of stakeholders. Bioinformaticians and bench scientists communicate requirements and specifications for the analysis. Once the requirements are defined, the bioinformaticians/data scientists develop tools and algorithms on their preferred development environments. The resulting software is then pushed to a code repository with the required information to reproduce the software environment. The data science platform pulls the developed code, creates the required environment, and shares easy-to-access web applications that bench scientists can use with their devices without having to install anything. Bench scientists authenticate with the platform using a single sign-on login and start interacting with the shared application. To execute the application, the platform should be able to search metadata and access data shared with the bench scientists by using Findable, Available, Interoperable, and Reusable (FAIR) principles [2]. The bench scientists then set the pipeline’s parameters and scale the computation on an HPC cluster, cloud infrastructure, or hybrid infrastructure that saturates the on-premises compute before it bursts to the cloud. Finally, the results are ingested back on the platform for further visualization and interpretation.

Image
A figure showing enterprise data science platforms and their relationship between biologists, data repositories, high performance and/or cloud compute, algorithm repositories and data scientists.
Figure 1. Enabling technologies for enterprise scientific computing.

To realize the process in Figure 1, platforms for sharing data science workflows should enable the following key functionalities:

  1. Data and metadata access and storage
  2. Web-based interfaces for visualization, parameter selection, and customization for algorithms
  3. Scale-up of the computational jobs on-premises, on the cloud, and with hybrid approaches
  4. Support for computational workflows/pipelines and data provenance
  5. Communication/sharing of pipelines between collaborators and access control for pipelines in development/production
  6. Reproducibility of the software environment and the corresponding analysis
  7. Support for clinical regulations
  8. Support of development/operation best practices, such as version control, unit testing, and Integrated Development Environment
  9. Authentication of users and access authorization for data and pipelines

In the remainder of this article, we explain these requirements and provide examples and potential solutions for implementation.

1. Data and metadata access and storage

Scientific data can be ingested from multiple sources, including direct data upload, connection to data application programming interfaces (APIs), cloud buckets, and databases. Examples include the NCI Cancer Research Data Commons (CRDC) and the model and data clearing house (MoDaC). Web-based data access requires multiple connections to various archives, and these data need to be handed off from system to system. However, data are often large and can be expensive to store and transfer; consequently, it’s better to share metadata about the data before launching transfer requests. Platforms for data science would allow connections to multiple metadata APIs that can sometimes scale to thousands or millions of files or attributes.

For researchers to find data of interest, data repositories that implement the FAIR principles would be connected to the data science platforms. A platform would present the data to the bench scientists by using data catalogs that summarize the main metadata published with the data, such as disease type, data modality, number of samples in a project, data versions, and owners. Meanwhile, a single sign-on login should give users seamless access to the data they are authorized to view and use.

Databases are another source of data. Database connectivity can be essential for applications that have a persistent state or have multiple users performing different roles, such as a system where some users enter or generate data while others use and review the data to produce publications. This kind of functionality is typical of a structured database, which ensures that there are no data collisions when two or more users, at least one of whom writes the data, access the data. Data science platforms should be able either to connect to databases or to provide database-like functionalities to support various workflows.

It is usually recommended to isolate the technology used to store the data from the technology used to store the metadata. This allows data to be migrated between different storage solutions based on the data lifetime policy.

2. Web-based interfaces for visualization, parameter selection, and customization for algorithms

Visualization of digital biomedical data is an essential part of bioinformatics analysis. A web-based approach provides the same environment across an institute so that researchers and bench scientists do not need to build dedicated environments to reproduce the research.

Different data types require specialized visualization tools. Next-generation sequencing data need a genome browser, high-throughput screening of fluorescent microscopy images needs tools to highlight hundreds or thousands of wells in multiple channels and at different resolutions, whole-slide pathology hematoxylin and eosin (H&E) and multiplex immunofluorescence images need a way to display different parts of the tissue or the biopsy at different resolutions in up to tens of channels, and small-molecule drug analyses need 3D plots. The platform would also need to support other visualization tools, such as those for numerical tabular data and exploratory data analysis techniques. These include dot plots, histograms, box plots, line plots, and 2D and 3D t-SNE plots.

While data-specific visualization tools usually come from open-source tools, data scientists using Python, R, and other programming languages develop many custom-made visualization plots. Supporting many data-specific, open-source, web-based visualizations in one platform would require a significant amount of effort. This has driven many commercial bioinformatics platforms (e.g., Terra, HALO, OMERO) to specialize in specific data types, while other platforms would accommodate as many data types as possible.

Another important function of the web interface is to allow nontechnical bench scientists to tweak the relevant parameters for a given pipeline and observe the change in results with minimal help. This step is usually part of the parameter adjustment phase and runs on a relatively small amount of data. It is normally followed by large-scale computing on all the available data that need to be scaled using more compute resources.

3. Scale-up of the computational jobs on-premises, on the cloud, and with hybrid approaches

A faster time to solution can be achieved via parallel computing or HPC. To speed up analyses, users must identify two components: parallel jobs and multiple compute nodes. Parallel jobs can be independent tasks in one pipeline, multiple samples to be executed by the same pipeline, independent hyperparameter optimization for machine learning models, or loosely dependent

tasks in a given computation. Parallel data science pipelines often need to scale up the computation on a large number of compute nodes or require special hardware for computation. The type of compute nodes would include large memory units, nodes with accelerated computing capabilities (e.g., graphics processing units), nodes with specific input/output capabilities, and nodes with fast communication standards (e.g., InfiniBand). The method of scaling computation on a cluster depends heavily on the amount of computation that would take place on a given node versus the communication that takes place between nodes. Fairly independent jobs can be sent to a scheduler for assignment to a specific node on-premises or on the cloud. Jobs that require coordination would need to make use of a programming workflow to facilitate communication and job scheduling.

A data science platform should coordinate with the on-premises and cloud resources so that users can submit jobs to the compute cluster, track jobs’ status, and ingest results back. Another consideration is allowing fault tolerance and error recovery for jobs that fail due to unexpected hardware or software faults and establishing a means to inform a user of their job’s status.

Scalable programming interfaces like Apache Spark provide the engine for large-scale data processing on-premises and on the cloud (e.g., slurm, Kubernetes). Other low-level interfaces, like the Message Passing Interface (MPI), should be used in HPC resources for applications that need extensive customization. When on-premises and cloud resources are both available, it is usually advantageous to saturate the on-premises resources before off-loading burst compute on the cloud. Data scientists might be required to refactor applications to expose parallelism and make use of distributed computing resources. In some cases where the computation-to-communication ratio is small, communication can occur via writing and reading partial results to and from the file system. This process is usually described using a workflow management system in an intuitive way.

4. Support for computational workflows/pipelines and data provenance

Bioinformatics analysis pipelines often consist of many dependent steps to digest data in their native format and preprocess them (e.g., normalization, quality control, harmonization across different sources). The pipelines also incorporate multiple consecutive parallel steps of processing. Every step takes a defined input and generates an intermediate output that is used

in subsequent steps. Many pipelines run for multiple input samples to generate the corresponding outputs.

There are many open-source workflow managers that data science frameworks can adopt. Those used at the Frederick National Laboratory include Snakemake, Nextflow, and Common Workflow Language. These frameworks help data scientists and bioinformaticians develop and test the pipelines. Platforms might also adopt a proprietary framework as long as the workflow can be executed outside the platform for reproducibility studies. Workflows should allow the developers and the end users to customize the parameters required to reproduce the results at every step in the pipeline.

Workflows are also expected to provide a source for data provenance, which necessitates capturing any key information about the version of the input data, versions of the used tools, and—potentially—any important parameters and partial results required to explain the results. This is crucial, especially for pipelines that train and optimize machine learning models to capture the hyperparameters and settings used to generate a model. For analyses associated with publications, audit logs and reports related to provenance would be invaluable in documenting the results, interpreting them, and sharing them with collaborators.

5. Communication/sharing of pipelines between collaborators and access control for pipelines in development/production

Collaborative research often requires stakeholders to be able to edit and see changes to documents and programs in real time, or to be able to develop a tool in a single programming environment that, by design, would guarantee that application could be reproduced from development to production within a specific platform. Other key aspects for shareability in highly regulated environments are the control of access to sensitive information and the tracking and auditing of data trails. The ideal platform would also support these.

Shareability can mean several things. While real-time shareability is a nice goal, there are other methods. For example, during application development, the platforms should provide different access control roles for stakeholders who view/read forms versus those who modify/write them or own/administer them. This access can be provided to individuals or groups. Similarly, the

platform should be able to ingest code and software specifications from public repositories and allow developers to share their tools with a specific list of users.

Data science platforms that allow the creation of a sharable environment should also allow users to export the code base and the environment’s specifications so that the users can also reproduce the tools on alternative platforms.

6. Reproducibility of the software environment and the corresponding analysis

The goal of reproducible research is for different labs to replicate methods with little effort. Developers create and use open-source codebases and tools to develop their applications. Creating a reproducible software environment has been a challenge for programmers and system admins. Hardware platforms, operating systems, and APIs are varied, replete with different features, versions, and behaviors. Recently, a plethora of versions of application programming environments (e.g., Java 7 and Java 8, Python 2.7 and Python 3.x, and R 3, and R 4) has made it difficult to distribute code. If a data science platform is to be robust and use open-source packages developed in each of these languages at the same time, all these versions must be enabled on the same server.

Figure 2 shows different software layers that should be customized and hosted by the data science platform. The first three layers are related to data processing and tools integration that is independent from the data science platform, while the fourth relates to parameter selection and visualization specific to the data science platform. The application-specific layers can be maintained in a repository shared with the public, while the platform-specific layer can be interchanged between platforms. This division enables application migrations between different platforms and reduces the overhead of updating and maintaining scientific applications.

Image
Numbered blocks representing layers. From top to bottom: 4-Interactive Layer (Platform-specific configuration). 3-Scientific applications (code repo). 2-Data science libraries (package configuration). 1-Operating system/containers.
Figure 2. Composition of scientific software layers.

Many tools have been developed to enable reproducibility, including package managers (e.g., conda for Python, renv for R), container technologies (e.g., Docker, Singularity), and virtual environments. The data science platforms might need to isolate applications so that they run in their own contained environments in order to support multiple packages in a secure way. In addition to creating the required environment, the platforms should download data science packages and ingest code bases from public repositories, as shown in Figure 2. In some scenarios, publicly available tools would need to be cleared from a security/regulatory standpoint, which would slow down the development/operation cycle.

7. Support for clinical regulations

In a research and development setting for biological data, data science platforms can, as expected, be used in research projects with minimal Protected Health Information and Personally Identifiable Information data, yet some projects could require support for regulatory standards. This includes regulations related to medical devices, the Health Insurance Portability and

Accountability Act (HIPAA), electronic systems for clinical trials, and information security plans to protect sensitive data (e.g., FISMA). Here, we list a few of these regulations.

HIPAA

While interacting with Personally Identifiable Information doesn’t always apply to basic research on public datasets, the ability to manage such data is an important benefit for a robust system. Data science platforms that would support clinical trials must comply with HIPAA regulations.

Code of Federal Regulations (CFR) 21, part 11

CFR 21, part 11, contains crucial regulations that govern electronic systems used to support clinical trials. These regulations can be interpreted as industry best practices. It is important to know the degree to which analytical systems support CFR 21, part 11, and interact with systems that support CFR 21, part 11.

Software as a Medical Device

The FDA regulates the development of software and machine learning pipelines to predict diagnosis [https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device], including all aspects of software development from prototype design to evaluation, retention of records, and beyond. In the case of adaptive learning machines that may continue to learn and adapt even after the initial approvals, there are prescriptions for data curation and segregation to validate manufacturers’ package inserts and claims.

8. Support of development/operation best practices, such as version control, unit testing, and Integrated Development Environment

Validating and testing developed or imported tools are essential for moving to production. Hence, data science platforms should support intuitive environments for code development, refactoring, modularization, versioning, debugging, and automated testing. Often times, this requires the platform to integrate with code development tools (e.g., git, visual studio code) for multiple data science languages (e.g., Python, R, SAS) to accelerate the development and operation cycle.

If a machine learning model is the output of a pipeline, the platform would facilitate moving the model from development to production and track the model’s perform

to achieve this is by providing scalable representational state transfer (REST) APIs to call the model in inference. Dashboards for a given model’s statistics could also be developed to communicate the model’s return on investment with the business team.

In addition, modern scientific analysis workflows in biomedicine are increasingly dependent on open-source programming languages, such as R and Python. Both languages take advantage of mature scientific computation frameworks (Bioconductor, Tidyverse, NumPy, pandas, Matplotlib, Scikit-learn, among others) that are becoming standard for processing, visualizing, and modeling scientific data generated from a variety of instruments, such as next-generation DNA/RNA sequencers, microscopes, and mass spectrometers. Furthermore, both R and Python provide literate programming platforms, such as Jupyter Notebooks, R Markdown documents, and Shiny applications.

A data science enterprise platform should enable developers to easily and seamlessly use the mature and state-of-the-art packages developed by the community for scientific computing. Seamless development is achieved by allowing import and export of custom code and libraries with very little refactoring of vendor-specific syntax. This adoption of open-source software promotes computational transparency, reproducibility, and accessibility, which are all essential to ensure the scientific process’ integrity between stakeholders.

9. Authentication of users and access authorization for data and pipelines

All stakeholders need to authenticate with the data science platform. This includes users who generate data, data scientists, bench scientists, and potential internal or external collaborators. At an enterprise level, adopting single sign-on technologies would streamline access to the data, scientific tools, compute resources, and visualization without prompting users to enter their credentials multiple times. Consequently, if the goal is to seamlessly adopt the latest cyber security and authentication protocols (e.g., moving from password authentication to two-factor authentication), it is essential for a platform to integrate with the enterprise’s identify providers and authentication mechanisms. Once users are authenticated, the platform should manage the authorization and access control lists for the data, scientific workflows, results, and related items. In a shared environment, these features are essential to guarantee the integrity, confidentiality, and auditing requirements for scientific data and pipelines.

Conclusion

In this article, we have highlighted key functionalities, challenges, and requirements relevant to enterprise data science platforms in a bioinformatics setting. A successful platform must provide many essential functionalities that facilitate the software development process for data science. These include creating data and metadata catalogs, reproducing software environments, scaling compute resources, supporting best software development practices, and developing easily configurable pipelines and visualization tools. In addition, to satisfy the requirements for clinical data, the platform should support a mechanism to guarantee the integrity and confidentiality of the data and the software, meet regulatory standards for Personally Identifiable Information and Protected Health Information data, and enable mechanisms for data provenance.

Implementing and customizing these capabilities requires close coordination between all stakeholders, including the data scientist, tool developer, bench scientists, biologists, investigators, system admins, regularity and security teams, infrastructure teams, and business owners. We believe such tools will greatly accelerate research and the development of data science tools to build computational pipelines, analyze patterns, generate insights, and develop new treatments for patients.

References:

[1] Bhawsar PM, Abubakar M, Schmidt MK, Camp NJ, Cessna MH, Duggan MA, García-Closas M, Almeida JS. Browser-based data annotation, active learning, and real-time distribution of artificial intelligence models: from tumor tissue microarrays to COVID-19 radiology. Journal of Pathology Informatics. 2021 Jan 1;12(1):38.

[2] Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data. 2016 Mar 15;3(1):1-9.