Introducing Gaia-01: Decoding nature’s molecules

Introducing Gaia-01: Decoding nature’s molecules

Published October 27, 2025

Email contact@novogaia.bio

We are excited to introduce Gaia-01, a new foundation model for molecular structure prediction from mass spectrometry data that achieves a 13% performance improvement over the current state of the art. Gaia-01 marks an important step forward in addressing a key bottleneck in drug discovery: the identification of novel, structurally diverse small molecules as starting points.

We are excited to introduce Gaia-01, a new foundation model for molecular structure prediction from mass spectrometry data that achieves a 13% performance improvement over the current state of the art. Gaia-01 marks an important step forward in addressing a key bottleneck in drug discovery: the identification of novel, structurally diverse small molecules as starting points.

We compared the accuracy of our model in predicting molecular structures from mass spectra on the industry standard MassSpecGym benchmark for De novo molecule generation. Gaia-01 outperforms all other models, MADGEN (Tufts, 2025), DiffMS (MIT, 2025), and MIST-MolForge (DSO, 2025). Top-10 accuracy measures how often the correct molecular structure appears among the model’s ten highest-ranked predictions.

Expanding chemical space for drug discovery

Expanding chemical space for drug discovery

Small molecules therapeutics remain the cornerstone of modern medicine. Yet most new drugs are built from a limited set of known chemical scaffolds that are easy to synthesize. This has kept discovery efforts clustered in a narrow region of chemical space, repeatedly optimizing known, low-complexity structures rather than exploring novel ones.


Meanwhile, nature operates in a far broader chemical universe. Over billions of years, it has generated complex, functional molecules far beyond the diversity of what humans can make. About half of all approved small-molecule drugs are in fact structurally inspired by natural designs, with evolution optimizing these molecules for biological function. The visualization below shows how natural molecules (from animals, bacteria, fungi, and plants) occupy distinct regions of chemical space compared to synthetic ones. Each dot represents a molecule, clustered by similarity.

Zoom through chemical space

While screening nature for new molecules has sharply declined since the 1980s due to slow, manual discovery workflows and limited structural identification tools, advances in machine learning and analytical chemistry now make it possible to decode nature’s chemistry at scale. This unlocks vast, bioactive regions of chemical space that were previously out of reach for drug discovery.

Zoom through chemical space

AI for molecular structure prediction from spectra

AI for molecular structure prediction from spectra

The key step for knowing what molecules are in a natural sample, and whether they are interesting for drug discovery, is decoding their chemical structure. Mass spectrometry is the fastest and most sensitive method for profiling molecules from natural samples, and its resolution is now orders of magnitude higher than during past large-scale screening efforts. The technology works by breaking molecules into fragments and measuring their mass-to-charge ratios, the patterns of which can be used to infer a molecule’s chemical structure. Recent advances now allow fine-grained distinction between closely related molecules and reconstruction of complex molecular structures directly from natural samples. Modern machine learning transforms this detection tool into a predictive tool for inferring molecular structures.

Gaia-01

At Novogaia, we’ve built Gaia-01, an autoregressive transformer model that predicts molecular structures directly from mass spectrometry data. The 1-billion-parameter model was trained using self-supervised pre-training on one hundred million small molecules, followed by fine-tuning on two hundred thousand labeled spectrum–structure pairs. This large-scale training primes Gaia-01 to generalize beyond known molecules and accurately infer complex, novel structures.


For Gaia-01, we built the model in conjunction with the MIST fingerprints predictor from DiffMS, similar to the approach used in MIST-MolForge. Because these checkpoints are self-reported and cannot be flexibly extended, we are now developing Gaia-02 with a fully independent encoder. Gaia-02 currently achieves 8.6% top-10 accuracy with robust performance across diverse chemical space. Full results to be detailed in an upcoming preprint.


The MassSpecGym benchmark measures how precisely a model can rebuild a molecule from its spectrum, but in practical drug discovery settings, what truly matters is being able to predict chemical properties that define a compound’s potential. Using Gaia-01, we can infer these properties directly from predicted molecular structure, making it a powerful tool for targeted screening and molecule prioritization.


Below, we show ten drug-likeness properties derived from Gaia-01’s predicted structures by calculating R² scores. This measures how closely a model’s predictions match real data. The closer it is to 1, the more accurately the model explains the true variation in the measured property. Although public data for MIST-MolForge are not yet available, our comparison with DiffMS shows that Gaia-01 achieves substantially higher R² scores, delivering accuracy sufficient for direct experimental use in the lab.

At Novogaia, we’ve built Gaia-01, an autoregressive transformer model that predicts molecular structures directly from mass spectrometry data. The 1-billion-parameter model was trained using self-supervised pre-training on one hundred million small molecules, followed by fine-tuning on two hundred thousand labeled spectrum–structure pairs. This large-scale training primes Gaia-01 to generalize beyond known molecules and accurately infer complex, novel structures.


For Gaia-01, we built the model using the MIST encoder checkpoint from DiffMS, similar to the approach used in MIST-MolForge. Because these checkpoints are self-reported and cannot be flexibly extended, we are now developing Gaia-02 with a fully independent encoder. Gaia-02 currently achieves 8.6% top-10 accuracy with more robust performance across diverse chemical space. Full results to be detailed in an upcoming preprint.


The MassSpecGym benchmark measures how precisely a model can rebuild a molecule from its spectrum, but in practical drug discovery settings, what truly matters is being able to predict chemical properties that define a compound’s potential. Using Gaia-01, we can infer these properties directly from predicted molecular structure, making it a powerful tool for targeted screening and molecule prioritization.


Below, we show ten drug-likeness properties derived from Gaia-01’s predicted structures by calculating R² scores. This measures how closely a model’s predictions match real data. The closer it is to 1, the more accurately the model explains the true variation in the measured property. Although public data for MIST-MolForge are not yet available, our comparison with DiffMS shows that Gaia-01 achieves substantially higher R² scores, delivering accuracy sufficient for direct experimental use in the lab.

What Gaia-01 enables

Gaia-01 advances two critical capabilities:

  1. New chemical starting points for drug discovery, directly from nature

    Gaia-01 allows us to rapidly identify molecules with drug-like properties from natural samples. From these natural molecules, we can design synthetic analogues for testing against therapeutic targets, bridging nature’s molecular diversity with modern medicinal chemistry.

  2. A vastly expanded data foundation for generative molecule design

    Current generative small molecule models repurpose known compounds due to limited data. Gaia-01 can recover molecular structures hidden in millions of publicly available mass spectral datapoints, expanding the set of known natural molecules by up to 100-fold. This opens the door to generative models that learn not just from human-made chemistry, but from nature’s own design principles.

Next steps

Gaia-01 was built through the dedication of a small team over a few intense months. We are computational biologists and machine learning engineers from leading academic labs at Imperial College London, UCL, TU Delft, ETH Zurich. We are backed by a former GSK and Merck executive and serial-biotech entrepreneur, David Pompliano, and a pioneer in ML for metabolomics, Tomáš Pluskal.


At Novogaia, we apply these technologies to decode fungal chemistry. Fungi remain one of nature’s richest but most unexploited sources of pharmacologically active molecules. Our mission is to unlock a new era in drug discovery from fungi by using AI to systematically uncover their molecular diversity and translate it into new therapeutic breakthroughs. To make that happen, we’re building a broader AI-driven discovery pipeline that brings this technology fully to life. We’ll share more about this soon.


Next, we’re excited to test Gaia-01 in the lab and demonstrate how it can accelerate real-world discovery on our first set of drug targets related to autoimmunity. We’re making the model available to select research partners to expand its applications. If you’re interested in trying Gaia-01 or collaborating on natural product discovery, we’d love to connect: contact@novogaia.bio.

© 2025 Novogaia Inc.