Life scientists have a data problem: information is fragmented, siloed and incomplete. And that gets in the way of taking full advantage of using artificial intelligence technology.
A panel of researchers discussed challenges to adopting AI tools in life sciences at the Intelligent Application Summit hosted by Madrona Venture Group in Seattle last week.
Artificial intelligence is transforming how tech companies do everything from selling products to routing packages. New AI “foundation” models like GPT-3 and DALL-E that can generate new sentences or images were built using massive training sets pulled from the internet.
But in the life sciences, “the standardization of the data is very challenging,” said panelist Maddison Masaeli, CEO of Deepcell, a startup that visually analyzes and categorizes single cells.
Cell biology information is plagued by differences in sample collection, storage and processing, said Masaeli, hindering comparisons across datasets. “From the point of sample collection until you have the image, there are tens of steps that cause variability in the data,” she said.
Not all life sciences data are messy. Protein structures, for instance, are represented in standardized ways in standardized databases. That enabled the training of DeepMind’s AlphaFold and the University of Washington’s RoseTTAFold, AI tools that recently cracked open the longstanding problem of predicting protein folding. More recently, the UW released ProteinMPPN, an AI-powered protein design tool.
But even for proteins, a lot of information is behind a wall. Lucas Nivon, CEO of Seattle protein design startup Cyrus Biotechnology, said that Cyrus approached big pharma companies about sharing their databases on the structure of antibodies, the basis of many treatments. Tens of thousands of such structures are siloed at various companies.
The companies were all interested in pooling data, and discussed mechanisms for sharing proprietary structures, said Nivon. “And then nobody wanted to be the first the lead investor, so to say,” said Nivon.
Cyrus joined with Amazon Web Services and other partners this summer to create an open-source protein design nonprofit, OpenFold, that is now talking with potential partners about sharing such antibody structure data.
“There is that dark matter that is just sitting there on the side. It’s literally just there,” said Nivon. “And everybody admits it.”
The issues of reliability and bias that plague AI modeling in tech applications also affect the life sciences, but in different ways, said the panelists.
When AI churns out a nonsense paragraph, users can see that right away. But if it’s spitting out the wrong diagnosis or the wrong protein structure, it’s harder to assess, said Jonathan Carlson, who leads life sciences research and incubation at Microsoft Health Futures, part of the tech giant’s research division.
“Many of the problems we see in life sciences are not unique, but they’re very acute,” added Carlson.
Testing products made through AI and then feeding the data back into the model sounds tidy in principle, but in the life sciences the process can take a long time. Cyrus is testing some of its engineered proteins with collaborators who are generating new transgenic mice, a process that can take well over a year. But Nivon’s team also leverages high throughput in vitro and cellular screening systems.
Efforts to optimize screening systems will enable faster honing of AI models, said Nivon. He points to Capsida Biotherapeutics, which iteratively engineers and screens designs for gene therapy using animal models, harvesting tissue to assess which are effectively getting to the right place in the body.
Researchers would like to better connect biological data to clinical outcomes, but there’s a lot standing in the way, including the need to protect privacy, said Masaeli. “There is no one power of Google that includes all the health data or biological data of the world,” she said.
Carlson envisions a future when more life sciences data are de-identified and funneled into standardized, interconnected formats. Ultimately, data from clinical trials and animal experiments could feed back efficiently into a network to help develop new hypothesis and hone questions for basic research.
How to get there is a major question for the field, said Carlson: “How do we enable collaboration while still respecting not only intellectual property but privacy? What does it actually mean to be able to build large foundation models when we can’t even get the data open?”