A data set doesn’t sound like a flashy scientific advance, but that impression belies a concept with the potential to enable eye-catching science.
Such is the case with CEM500K, a new data set created by Kedar Narayan, Ph.D., and Ryan Conrad in the Center for Molecular Microscopy at Frederick National Laboratory. As reported in eLife last week, it’s a cutting-edge resource that substantially improves a technique for electron microscopy analysis.
CEM500K is a curated resource for conditioning artificial intelligence (AI) algorithms to recognize relevant features in electron microscopy images, and it conditions the algorithms better than any other published data set.
A universal training set
This recognition process is called “segmentation,” a method for highlighting a feature from the rest of the image, such as picking out mitochondria from electron micrographs of a cell.
Microscopists often do segmentation manually, as Narayan and Conrad have done. It’s a painstaking process that can take weeks or months of on-and-off work, creating a large bottleneck in research. Technological advances in microscopy have made it even less practical.
“You have microscopes that can generate hundreds of thousands of images. It’s terabytes, tens of terabytes of data. It’s just not possible for anyone to go through and look at and annotate the data,” said Conrad, who is a research associate in the Center for Molecular Microscopy.
AI can do it much faster, and scientists have been trying for years to train models and algorithms using data sets. The algorithms view and analyze the images either independently or guided by a human. Over time, they become increasingly adept at segmenting relevant features the microscopists want to study.
But it’s still challenging. Most of these data sets are smaller, limited in scope, and relevant to a single project. That means the AI can segment well for the project at hand, but it usually struggles to segment images from a different project. For instance, an AI that learns to segment mitochondria in images of melanoma cells may falter when presented with images of kidney cells. Often, new projects require scientists to develop a new data set and AI.
Narayan and Conrad built CEM500K with these inefficiencies in mind. They curated 500,000 unique, diverse images from more than 100 studies to create a data set that applies to a broad spectrum of electron microscopy. CEM500K is meant to be a “universal” training set that can be used to refine an AI model or algorithm for any project.
“[It] is pure gold for training. Why? Because it’s relevant, it’s information rich, it’s heterogenous and nonredundant, and it’s nimble,” said Narayan, senior scientist and volume-EM group lead in the Center for Molecular Microscopy.
Beating the benchmarks
To validate their creation, Conrad and Narayan developed an untrained neural AI model and conditioned it on CEM500K, then used it to segment diverse electron microscopy images. For comparison, they conditioned copies of the same untrained model on six other publicly available, gold-standard benchmark data sets and instructed the copies to segment the same diverse images.
The model trained on CEM500K far outperformed them all, demonstrating the data set’s usefulness for various electron microscopy projects.
“It recognized, without any human intervention, what was a feature of interest in an electron micrograph,” Narayan said. “It was able to beat all benchmarks, and not only did it beat all benchmarks, it uncovered errors that humans had made in those benchmarks.”
For additional validation, the pair instructed other models to train on CEM500K, then tested their segmentation abilities. These models performed better after being trained on CEM500K than they did after training with other sets.
According to Conrad, this level of accuracy makes CEM500K a potent resource that may enable larger and more robust analyses in the future.
More power on the horizon
Conrad curated the images for CEM500K from internal and external projects, spending weeks scouring journals and databases across the internet. The collection initially amounted to more than five million micrographs. From there, he worked to wrestle it into something more informative.
“I actually ended up using another AI, another neural network, to help with this. So I just went through and saw the images that I thought looked ‘bad,’ some images that looked ‘good’ … and then taught the neural network to figure that out, figure out what I was able to see but couldn’t express in a computer program,” he said, adding that he standardized the images into one shared format for added consistency.
The result was the 500,000 images that now comprise CEM500K, but there’s still room to grow. Both the data set and Conrad’s sorting pipeline are publicly available online. Narayan and Conrad hope to partner with and invite other groups to contribute their data via the pipeline so CEM500K can become more useful as time passes.
“The great thing about CEM500K is that the resource is the data set, not the model, meaning that it can become more powerful as it expands and the models become more powerful,” Narayan said.
“We want to have enough data that these models will see more than any human has ever seen,” he said. “Anything that can be seen, hopefully, [an] AI system will [learn to] see. And then what that allows you to do is, as a researcher, you can look at these results and immediately understand what’s going on in your data.”
Image by Kedar Narayan and Joe Meyer, staff illustrator