AI and Data Science Standards

Data science pursues a rigorous and systematic process based on the scientific research method, promoting a circular approach that includes:

Understanding of the problem
Data exploration
Data preparation
Algorithm selection and model building
Model performance evaluation
Model deployment

Standards, processing techniques and standard operating procedures are at the core of data science and their use in this field is critical. Data science is at the intersection of laboratory research and data technologies, enhancing knowledge discovery, and advancing in silico studies.

Our scientists in the Cancer Data Science Initiatives and Bioinformatics and Computational Science program have curated these standards and make them available to researchers to advance reproducibility in data science.

Bioinformatics

Enterprise data science platforms for scientific computing and machine learning

An overview of required functionalities and components in bioinformatics research.

Machine learning models

A deep learning pipeline for nucleus segmentation

Importance of creating a standard data science pipeline for evaluating machine learning models.

Data science workflow

The data science life cycle can be depicted via the Cross Industry Standard Process for Data Mining (CRISP-DM) and the OSEMN (Obtain data, Scrub data, Explore, Model, iNterpret results) framework. The workflow we elucidate in the following sections loosely aligns with these frameworks and can be summarized by the following steps:

1. Problem definition and understanding

Build an understanding of the question, requirements, and goals of the problem or challenge at hand

A good understanding of the research problem and its context enables selection of an appropriate approach to solving the problem in any discipline, including data science. Start by setting goals and expectations: What do you hope to achieve? What challenges (data-related, computational, infrastructure, resources, etc.) might you encounter? What are the expected benefits?

Next, follow these steps to define the problem:

Frame the problem statement.
Translate the statement into a data science problem, defining it from a perspective that includes data analysis, metrics, and patterns. For example, many data science problems can be categorized into supervised or unsupervised learning.
Gather prior knowledge related to the problem and identify the data required and available.

2. Data preparation

Aggregate, assess, prepare, clean, and finalize the dataset involved

Having a comprehensive understanding of the dataset to be studied is crucial to developing effective models from the data. This will further ensure that what you want to accomplish is possible.

A common misconception is that the following step in the data science workflow—modeling and development—is the most time consuming and important. In reality, data preparation often takes the bulk of the time, and when performed effectively, this step typically makes modeling much easier and more efficient.

Common steps taken at this stage include:

Processing the dataset to address issues such as missing values, corrupt records, data inconsistency, incorrect data types, privacy (anonymization), and formatting.
Exploring the data to develop an understanding of the information included in the data. If possible, identify any patterns, correlations, and characteristics/attributes of the data. Identify which attributes are more important to solving the problem.

3. Modeling and development

Build models, algorithms, and analysis workflows using the relevant dataset

The next step in the data science life cycle is the development of an algorithm, analysis workflow or building of a model that will be learned from the data. The selected models are generally heavily informed by the previous data preparation step.

The purpose of modeling is generally the development of a parameterized mapping between the data and the response set in the form of a function or process that has learned the characteristics of the data. While “modeling” in data science refers to machine learning models, it can also include experimental, probabilistic, graph theory, and differential equation models.

The following steps are typically involved at this stage:

Model selection: There are often particular models associated with specific types of problems, e.g., image processing and text-related tasks.
Data splitting: The models and their associated algorithms are often trained on a subset of the data called the training set, and their performance is measured on another subset called the test set. In addition, often a third subset of the data—the validation set—is held out for determining the best set of parameters—called hyperparameters—that define the model itself.
Model training: In this step, the weights associated with the model are fit to the training set; specifically, a scalar loss function is numerically minimized with respect to the weights. Typical loss functions include:
- For classification tasks: accuracy, confusion matrix, logarithmic loss, area under the curve, F-score
- For regression tasks: mean absolute error, (root) mean squared error
- Other metrics: chi-square, confidence interval, predictive power

The next step in modeling and development is to quantify the model performance on the validation set using the same metric as for the training process on the training set.

A particularly powerful way to improve the model itself (as opposed to improving the data that go into the model or by combining models together) is called hyperparameter optimization (HPO). In HPO, a trained model is evaluated on the validation set, the hyperparameters are updated, the model is re-trained on the training set using this new set of hyperparameters, and the resulting model is re-evaluated on the validation set. This process is repeated until the loss calculated on the validation set is minimized with respect to the hyperparameters. Then, a final evaluation of the model performance using the optimized set of hyperparameters is made on the test set to serve as a prediction of how the model will perform in the “real world.”

Other ways to improve model performance include data augmentation, feature engineering, testing multiple models, and stacking models.

In the end, you will have developed a complete set of processing steps—i.e., an algorithm or analysis workflow that includes a model—that you would execute to perform a similar evaluation on a new dataset.

4. Evaluation

Evaluate the models against standard datasets and perform quality control/assurance for analysis workflows

Once the model has been trained and tested on the particular dataset of interest, it should be evaluated, if available, on standard datasets. This will afford a low-risk opportunity to determine how the model may perform in the “real world,” without first going through the efforts required in formal model deployment.

If such standardized datasets are not available for the application of interest, an alternative is to perform data augmentation techniques to force the input data to be different from the dataset used for training, validation, and testing.

The higher-level goal of this step is to understand how the model performs on data that are slightly different from the original dataset of interest. If small perturbations in the data result in large (and incorrect) variations in model performance, then the model is not sufficiently robust. Corrective measures in the overarching algorithm must be implemented to ensure that the results of the model remain of high quality.

This often requires moving backward to the previous step in the data science workflow (modeling and development), in which the overall analysis workflow is adjusted and re-analyzed. This is an example of the data science workflow not necessarily being a sequential set of steps but rather a fluid process involving feedback loops.

5. Deployment

Make the model or workflow available for use in the “real world”

Once the model is finalized, it is deployed to production or shared with the community. Model and workflow deployment can take many forms, including embedding models in applications or devices. Depending on the purpose of the solution, the process typically involves application developers, web developers, IT administrators, and/or cloud engineers. Deployment may be followed by further iterative processes for improvement and re-deployment.

6. Results and interpretation

Broadly interpret the results in the context of the application

Upon model deployment that reaches the desired level of accuracy, the final step is to interpret the results and present them to stakeholders. In this report—whether written or oral—it is important to know the audience (particularly their knowledge of data science) and to present the findings in plain language.

The following elements are typically included in such reports:

Context (problem domain or environment)
Problem (research question)
Solution (answer to the research question)
Findings (results of the model deployment)
Limitations
Conclusion

Resources

Data science frameworks

Datasets

Open data sets from freeCodeCamp
List of datasets for machine-learning research from Wikipedia
21 Places to Find Free Datasets for Data Science Projects from Dataquest.io
Built-in datasets from scikit-learn
Model and Data Clearinghouse, clinical oncology
Imaging Data Commons from the National Cancer Institute
Genomic Data Commons from the National Cancer Institute
Cancer Imaging Archive

Machine learning APIs

scikit-learn (Python)
caret (R)
Machine learning courses:
- Machine Learning from Coursera
- CS 299: Machine Learning course materials from Stanford University
Textbooks:

CDK2 inhibition may have potential for treating lung and other cancers

Clinical Monitoring Research Program to support clinical study exploring alternative to cervical cancer screening

Spring 2024 SeroNews

STAG2 Mutations in the Pathogenesis of Human Cancer

vEM 101: volume Electron Microscopy

Nanotechnology Characterization Laboratory

Scientific Standards Hub

2024 Technology Showcase

Morehouse School of Medicine biospecimens to boost diversity of cancer samples for proteogenomic analysis

Related Content

Scientific Standards Hub

Bioinformatics and Computational Science

Cancer Data Science Initiatives

Enterprise data science platforms for scientific computing and machine learning

A deep learning pipeline for nucleus segmentation

Data science workflow

Problem definition and understanding

Data preparation

Modeling and development

Evaluation

Deployment

Results and interpretation

1. Problem definition and understanding

Build an understanding of the question, requirements, and goals of the problem or challenge at hand

2. Data preparation

Aggregate, assess, prepare, clean, and finalize the dataset involved

3. Modeling and development

Build models, algorithms, and analysis workflows using the relevant dataset

4. Evaluation

Evaluate the models against standard datasets and perform quality control/assurance for analysis workflows

5. Deployment

Make the model or workflow available for use in the “real world”

6. Results and interpretation

Broadly interpret the results in the context of the application

Resources

Data science frameworks

Datasets

Machine learning APIs

Websites

Scientists at the Frederick National Laboratory