Data science pursues a rigorous and systematic process based on the scientific research method, promoting a circular approach that includes:
- Understanding of the problem
- Data exploration
- Data preparation
- Algorithm selection and model building
- Model performance evaluation
- Model deployment
Standards, processing techniques and standard operating procedures are at the core of data science and their use in this field is critical. Data science is at the intersection of laboratory research and data technologies, enhancing knowledge discovery, and advancing in silico studies.
Our scientists in the Cancer Data Science Initiatives and Bioinformatics and Computational Science program have curated these standards and make them available to researchers to advance reproducibility in data science.
The data science workflow
The data science life cycle can be depicted via the Cross Industry Standard Process for Data Mining (CRISP-DM) and the OSEMN (Obtain data, Scrub data, Explore, Model, iNterpret results) framework. The workflow we elucidate in the following sections loosely aligns with these frameworks and can be summarized by the following steps:
Click the icons to navigate
|Problem definition and understanding
||Modeling and development
||Reporting and interpretation
1. Problem definition and understanding
Build an understanding of the question, requirements, and goals of the problem or challenge at hand
A good understanding of the research problem and its context enables selection of an appropriate approach to solving the problem in any discipline, including data science. Start by setting goals and expectations: What do you hope to achieve? What challenges (data-related, computational, infrastructure, resources, etc.) might you encounter? What are the expected benefits?
Next, follow these steps to define the problem:
- Frame the problem statement.
- Translate the statement into a data science problem, defining it from a perspective that includes data analysis, metrics, and patterns. For example, many data science problems can be categorized into supervised or unsupervised learning.
- Gather prior knowledge related to the problem and identify the data required and available.
2. Data preparation
Aggregate, assess, prepare, clean, and finalize the dataset involved
Having a comprehensive understanding of the dataset to be studied is crucial to developing effective models from the data. This will further ensure that what you want to accomplish is possible.
A common misconception is that the following step in the data science workflow—modeling and development—is the most time consuming and important. In reality, data preparation often takes the bulk of the time, and when performed effectively, this step typically makes modeling much easier and more efficient.
Common steps taken at this stage include:
- Processing the dataset to address issues such as missing values, corrupt records, data inconsistency, incorrect data types, privacy (anonymization), and formatting.
- Exploring the data to develop an understanding of the information included in the data. If possible, identify any patterns, correlations, and characteristics/attributes of the data. Identify which attributes are more important to solving the problem.
3. Modeling and development
Build models, algorithms, and analysis workflows using the relevant dataset
The next step in the data science life cycle is the development of an algorithm, analysis workflow or building of a model that will be learned from the data. The selected models are generally heavily informed by the previous data preparation step.
The purpose of modeling is generally the development of a parameterized mapping between the data and the response set in the form of a function or process that has learned the characteristics of the data. While “modeling” in data science refers to machine learning models, it can also include experimental, probabilistic, graph theory, and differential equation models.
The following steps are typically involved at this stage:
- Model selection: There are often particular models associated with specific types of problems, e.g., image processing and text-related tasks.
- Data splitting: The models and their associated algorithms are often trained on a subset of the data called the training set, and their performance is measured on another subset called the test set. In addition, often a third subset of the data—the validation set—is held out for determining the best set of parameters—called hyperparameters—that define the model itself.
- Model training: In this step, the weights associated with the model are fit to the training set; specifically, a scalar loss function is numerically minimized with respect to the weights. Typical loss functions include:
- For classification tasks: accuracy, confusion matrix, logarithmic loss, area under the curve, F-score
- For regression tasks: mean absolute error, (root) mean squared error
- Other metrics: chi-square, confidence interval, predictive power
The next step in modeling and development is to quantify the model performance on the validation set using the same metric as for the training process on the training set.
A particularly powerful way to improve the model itself (as opposed to improving the data that go into the model or by combining models together) is called hyperparameter optimization (HPO). In HPO, a trained model is evaluated on the validation set, the hyperparameters are updated, the model is re-trained on the training set using this new set of hyperparameters, and the resulting model is re-evaluated on the validation set. This process is repeated until the loss calculated on the validation set is minimized with respect to the hyperparameters. Then, a final evaluation of the model performance using the optimized set of hyperparameters is made on the test set to serve as a prediction of how the model will perform in the “real world.”
Other ways to improve model performance include data augmentation, feature engineering, testing multiple models, and stacking models.
In the end, you will have developed a complete set of processing steps—i.e., an algorithm or analysis workflow that includes a model—that you would execute to perform a similar evaluation on a new dataset.
Evaluate the models against standard datasets and perform quality control/assurance for analysis workflows
Once the model has been trained and tested on the particular dataset of interest, it should be evaluated, if available, on standard datasets. This will afford a low-risk opportunity to determine how the model may perform in the “real world,” without first going through the efforts required in formal model deployment.
If such standardized datasets are not available for the application of interest, an alternative is to perform data augmentation techniques to force the input data to be different from the dataset used for training, validation, and testing.
The higher-level goal of this step is to understand how the model performs on data that are slightly different from the original dataset of interest. If small perturbations in the data result in large (and incorrect) variations in model performance, then the model is not sufficiently robust. Corrective measures in the overarching algorithm must be implemented to ensure that the results of the model remain of high quality.
This often requires moving backward to the previous step in the data science workflow (modeling and development), in which the overall analysis workflow is adjusted and re-analyzed. This is an example of the data science workflow not necessarily being a sequential set of steps but rather a fluid process involving feedback loops.
Make the model or workflow available for use in the “real world”
Once the model is finalized, it is deployed to production or shared with the community. Model and workflow deployment can take many forms, including embedding models in applications or devices. Depending on the purpose of the solution, the process typically involves application developers, web developers, IT administrators, and/or cloud engineers. Deployment may be followed by further iterative processes for improvement and re-deployment.
6. Results and interpretation
Broadly interpret the results in the context of the application
Upon model deployment that reaches the desired level of accuracy, the final step is to interpret the results and present them to stakeholders. In this report—whether written or oral—it is important to know the audience (particularly their knowledge of data science) and to present the findings in plain language.
The following elements are typically included in such reports:
- Context (problem domain or environment)
- Problem (research question)
- Solution (answer to the research question)
- Findings (results of the model deployment)
AI and data science vignettes
Here is a case study that explores our use of this workflow and process:
- Nuclei segmentation: A vignette on an implementation of a deep learning pipeline for performing nucleus segmentation on high-content imaging data.
Data science frameworks
Machine learning APIs
Scientists at the Frederick National Laboratory