The importance of creating a standard data science pipeline for evaluating machine learning models is highlighted for one frequent image processing task: nucleus instance segmentation in high-content imaging. The vignette elaborates on every step in the data science workflow and shares concrete results for evaluating three approaches to perform that task from a detailed, published paper.
Key points
- An annotated dataset of segmented nuclei with different characteristics has been created using semi-automated methods.
- An off-the-shelf model for image segmentation is compared to locally trained models.
- In three out of four sub-datasets, the off-the-shelf model can perform as well as locally trained models.
- The annotated dataset can be used as a standard benchmark in every lab to continuously report on performance of new models and algorithms.
1. Problem definition and understanding
High‐content imaging uses automated liquid handling, image acquisition, and image analysis to screen the biological effect of hundreds of thousands of perturbing agents, such as RNAi, CRISPR/Cas9, and chemical compounds. This is done by measuring cellular phenotypic changes in microscopy‐based assays of interest. Because most cell types possess only one centrally positioned nucleus, and the nucleus is used to identify individual cells, the precise and automated nuclear segmentation of nuclei stained with specific fluorescent dyes is the first essential analysis step in many high-content imaging workflows.
Since cell types and primary cells can have dramatically different nuclear sizes, there can be substantial differences in nuclear shape. These differences include lobulation in polymorphonuclear cells of the immune system or cell growth leading to adjacent or overlapping nuclei. This makes it challenging to develop, test, and implement robust algorithms that can provide accurate and automated nuclear segmentation results for a wide range of cell types and in different experimental conditions, even using the same image acquisition platform.
Deep learning is rapidly becoming the technique of choice for automated segmentation of nuclei in biological image analysis workflows. In order to evaluate the feasibility of training nuclear segmentation models on small, custom-annotated image datasets, researchers designed a computational pipeline to systematically compare different nuclear segmentation model architectures and model training strategies. This approach has demonstrated transfer learning by tuning the training parameters, such as the composition, size, and preprocessing of the training image dataset. This can lead to robust nuclear segmentation models, which match and often exceed the performance of existing, off‐the‐shelf deep learning models pretrained on large image datasets.
Collecting and distributing pretrained models on large, annotated biological image datasets is useful because it can, in principle, avoid the need for model training. Less is known about the performance of models pretrained on images of primary cells or of cell lines that possess nuclear shapes previously not “seen” by the pretrained models and that were acquired on different microscopes. In these cases, it would be useful to test which, if any, are the most effective steps for obtaining precise nuclear segmentation by training convolutional neural network models on relatively small sets of images, which are readily available in most laboratories.
To address these questions, we designed and implemented an end‐to‐end computational pipeline to quickly train and evaluate the performance of machine–learning‐based nuclear segmentation algorithms.
2. Data preparation
To mimic practical scenarios in the laboratory, researchers identified and generated segmentation labels in four cell types: HCT116, U2OS, MCF10A, and primary human eosinophils. These cell types have very different nuclear morphology and were acquired at different magnifications on the same high‐throughput imaging microscope.
The preliminary nuclear labels consist of 4,000 nuclei from four different cell types and a total of 10 images. The images were then manually corrected in an interactive fashion and used to train and evaluate the performance of different convolutional neural network‐based architectures for nuclear segmentation.
The semantic segmentation labels of nuclei from fluorescence microscopy images used both in training and testing of the segmentation models were generated semi‐automatically in two steps (Fig. 1A).
First, preliminary labels were automatically generated using either classical image processing techniques, for example, seeded watershed or existing, publicly available deep learning models for nuclear segmentation. Then, an expert cell biologist corrected nuclear segmentation mistakes using a combination of bitmask brushes and polygons to generate a set of high‐quality ground truth nuclear labels.
Images and labels for the cell types defined above were divided into two sets: training and testing. For every given cell type, researchers used the training datasets: MCF10A_Original, U2OS_Original, HCT116_Original, and Eosinophils_Original. When the models were trained on training datasets from all cell types, researchers used a total of 4,012 nuclei in 10 full fields of view. After random sampling, researchers assigned 80 percent of the 35,000 augmented regions of interest for model training and assigned the remaining 20 percent for model validation. Researchers only used the testing datasets for assessing the performance of the deep learning models at inference. Researchers used a total of 3,607 nuclei in 10 field of views as testing datasets from all cell types.
3. Modeling and development
The pipeline was used to explore different training strategies for two preexisting convolutional neural network‐based image segmentation model architectures, the Feature Pyramid Network‐2‐watershed and the Mask R‐CNN (Fig. 1B).
Mask R‐CNN
- Researchers used Matterport to implement Mask R‐CNN and set the size of the input layer for both model training and inference to 256 by 256 pixels.
- At inference time, we first divided full fields of view into overlapping grids of 256 by 256 pixels and set the overlap value to 50 pixels, which on average was enough to cover one nucleus.
- Researchers ran model inference on the regions of interest for a single field of view to generate nuclear labels with a unique identification.
- We merged nuclear segmentation labels with different identifications that were overlapping above a certain threshold across two or more regions of interest into a nuclear label with a single identification.
- We eliminated nuclear labels with an area below a certain threshold.
- Since Mask R‐CNN can sometimes predict more than one connected component per nuclear label ID, researchers retained only the largest connected component for every nuclear label identification.
Feature Pyramids Network‐2‐watershed
- Researchers trained two feature pyramids network-based models with one output layer to predict a normalized distance transform image for each nucleus and a Gaussian blurred (σ = 1) version of the nuclear label border, respectively.
- We then thresholded the predicted normalized distance transform of the Feature Pyramids Network‐2‐watershed model at 0.6 pixels.
- In the normalized distance transform the pixels values ranged between zero and one, such that pixels closer to the nuclear boundary have values closer to zero and pixels at the center of the nucleus are closer to one.
- We used the blurred border images as inputs to the seeded watershed segmentation algorithm for delineating nuclei (Fig. 1C).
4. Evaluation
Given ground truth nuclei identifications and the inference nuclei identifications, researchers calculated the F1 scores at different intersection over union thresholds. We used a two‐dimensional histogram to calculate the intersection area—the number of pixels—between all pairs of nuclei from the ground truth labels and the inference labels. We calculated the union area between every pair of ground truth and inference labels as the sum of the pixels belonging to either of these labels minus their intersection. For every pair of ground truth and inference labels, researchers calculated the intersection over union as the ratio between the intersection over the union. For any intersection over union threshold t between zero and one, the true positive (TP(t)) statistic was the number of ground truth labels that have an intersection over union value equal or higher than t with one inference label. Similarly, the false negative (FN(t)) statistic was the number of ground truth labels that have an intersection over union value lower than t with all inference labels. Finally, the false positive (FP(t)) was the number of inference labels that have an intersection over union value lower than t with all ground truth label.
Given TP(t), FN(t), and FP(t), the F1 score at a given intersection over union threshold t was calculated as:
F1(t)=TP(t)/(TP(t)+(FP(t)+FN(t)))/2)
Furthermore, researchers defined an over‐splitting event as a type of error where a nucleus in ground truth is matched with more than one inference nucleus with an intersection over union of at least 0.1. We defined a merge error as a type of error where an inference label was matched to more than one ground truth label with an intersection over union of at least 0.1.
The intersection over union used for calculating F1 score can aggressively penalize the performance readout of segmentation models for datasets with predominantly smaller objects, such as circular objects with a radius of 5 pixels. The penalty is more severe at threshold values t > 0.6. For example, consider one dataset in which all objects are perfect circles with a radius of 60 pixels and another dataset in which all objects are also perfect circles with a radius of 4 pixels. These two datasets, with some approximations, simulate images acquired on a microscope using 60X and 4X objectives, respectively. Also assume that the segmentation model always makes one‐pixel error along the boundary on both datasets. For the dataset with objects with a radius of 60 pixels, the intersection over union will be 0.967. For the datasets with objects with a radius of four pixels, the intersection over union will be 0.563. Thus, at an intersection over union threshold value t = 0.6 (IoU[t = 0.6]) the segmentation model will appear to be severely underperforming.
5. Deployment
The generated dataset and trained models have been made available for the public on figshare. The conda environment and the code to run the models in inference and the post inference scripts is shared in Github repos. The software and the models have been made available on local on prem compute infrastructure.
6. Results and interpretation
Researchers carried benchmarks on the nuclear segmentation inference performance of Mask R‐CNN and Feature Pyramids Network‐2‐watershed against Jacobkie, a pretrained deep learning model for nuclear segmentation. Jacobkie has a similar model architecture as Feature Pyramids Network‐2‐watershed and was trained on a large and diverse dataset of 841 images of nuclei from different sources. Jacobkie was ranked second out of about 10,000 submissions in the 2018 Kaggle Data Science Competition, and it has been suggested that this model can be used in inference by end users without pre‐training.
In our experiments using Jacobkie on the test images used in this study, visual comparison of inferred nuclear segmentation masks indicated:
- While Jacobkie was generally less precise than Mask R‐CNN or Feature Pyramids Network‐2‐watershed at the pixel segmentation level, it indeed achieved a comparable object segmentation performance for MCF10A and HCT116 cells.
- Jackobie achieved an inferior performance on U2OS or primary eosinophils, when compared to the custom trained models.
- Mask R‐CNN seemed to better generalize to irregular nuclear shapes, such as those of primary eosinophils (Fig. 2A).
- In line with the qualitative visual results, comparison of F1 segmentation scores for the three segmentation models indicated that Mask R‐CNN and Feature Pyramids Network‐2‐watershed achieved similar or better F1 inference scores than Jacobkie for all cell types at IoU[t = 0.7], which is indicative of instance segmentation precision.
- Mask R‐CNN and Feature Pyramids Network‐2‐watershed showed better F1 scores than Jacobkie at IoU[t = 0.9], which indicates segmentation precision at the pixel‐level for MCF10A and U2OS.
- As for HCT116 and eosinophils, none of the convolutional neural network models showed F1 scores greater than 0.5, indicating a failure of these models to perform precise segmentation at the pixel‐level of nuclei having a small area relative to the pixel size of the image (Fig. 2B, Table1).
| Model | MCF10A/0.7 | MCF10A/0.9 | HCT116/0.7 | HCT116/0.9 | U2OS/0.7 | U2OS/0.9 | Eosinophils/0.7 | Eosinophils/0.9 |
|---|---|---|---|---|---|---|---|---|
| FPN2‐WS | 0.98 | 0.96 | 0.87 | 0.14 | 0.97 | 0.94 | 0.46 | 0.21 |
| MRCNN | 0.97 | 0.95 | 0.9 | 0.4 | 0.97 | 0.95 | 0.83 | 0.39 |
| Jacobkie | 0.96 | 0.03 | 0.85 | 0.07 | 0.84 | 0.3 | 0.23 | 0 |