TerraMind

built by IBM and ESA Φ-lab

trained at Julich Supercomputing Center with funding via the FAST-EO project

🤗 Total downloads from Hugging Face: Loading... (all models)

🎉 TerraMind has been accepted at ICCV 2025!

Meet TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation. TerraMind represents new levels of understanding geospatial data, introduces new capabilities such as Thinking-in-Modalities (TiM), and outperforms existing models significantly across community-standard benchmarks.

💡 How does TerraMind work?

high-level contextual information to learn cross-modal relationships

fine-grained representations to capture critical spatial nuances

well-structured embedding space

Thinking-in-Modalities (TiM) tuning

generated any pre-training modality from other modalities

quantization step in the bottleneck

discretized with finite-scalar-quantization (FSQ)

🚀 How does TerraMind compare to other models?

TerraMind was benchmarked by ESA in both unimodal and multimodal settings following the community-standard PANGAEA benchmark. Overall, TerraMindv1-B outperforms all other GeoFMs by at least 3pp avg. mIoU. Importantly, TerraMind is the only foundation model approach in EO that outperforms task-specific U-Net models across the PANGAEA benchmark.

Radar Plot

PANGAEA bench results for TerraMind and the top 5 EO FMs based on average rank. The mIoU is visualized on a min-max normalized scale with the best performance in displayed in parentheses.

Performance evaluation of TerraMind across nine benchmark datasets using the PANGAEA evaluation protocol. Higher mIoU (↑) and lower rank values (↓) are reported. The best model is highlighted and the second best is underscored.

Performance Table

💭 What is Thinking-in-Modalities?

During fine-tuning or inference, TerraMind can pause for a moment, imagine a helpful but absent layer, append the imagined tokens to its own input sequence, and then lets the fine-tuned encoder continue to improve its own performance. Because the imagination lives in token space, the approach avoids the heavy diffusion decoding that full image synthesis would require. So, TerraMind can generate any missing modality as an intermediate step — an ability we call Thinking in Modalities (TiM).

TiM

"TiM tuning boosts data efficiency by self-generating the additional training data relevant to the problem being addressed — for example, by telling the model to "think" about land cover when mapping water bodies. This breakthrough can unlock unprecedented accuracy when specializing TerraMind for particular use cases" said Johannes Jakubik, an IBM Research scientist based in Zurich.

TiM Results

⭐️ Exploring the embedding space

TerraMind is pretrained on a cross-modal patch classification objective. Empirical results suggest that this results in a well-structured latent space that clusters different concepts accurately. To investigate this hypothesis, we apply 1-Nearest-Neighbor (1-NN) classification without applying any kind of weight updates. TerraMind outperforms other models significantly, pointing to a better structured embedding space.

Few Shot

For one-shot classification, a labeled support set and unlabeled query data are mapped into an embedding space using the TerraMind encoder. The targets are classified based on the shortest distance to the labeled samples in the embedding space.

1-shot 5-way classification results using nearest neighbors, measured in accuracy and averaged over 200 runs. TerraMind outperforms benchmarks from CV and EO, suggesting a well-structured latent space.

🛰️ Any-to-any generations

TerraMind is able to generate any modality from any other modality very efficiently. By using chained generations, the generated modalities are consistent as shown in the following figure. The generations can be applied to large tiles covering full landscapes as shown in the following examples. ALl below examples required ten diffusion steps using TerraMind-B.
Any-to-any generations

Large tile generation of Sentinel-1 RTC data using a Sentinel-2 L2A input from Singapore. Many features like ships or airport runways are clearly visible in the S-1 RTC generations while clouds are completly ignored.

Large tile generation of a Sentinel-1 GRD radar map using a Sentinel-2 L2A input from Santiago de Compostela.

Large tile generation of a land-use map using a Sentinel-2 L2A input from a bay near Santiago de Compostela.

📽️ Voices on TerraMind

"With Earth observation science, technology, and international collaboration, we are unlocking the full potential of space-based data to protect our planet" said Nicolas Longepe, Earth Observation Data Scientist at ESA. "This project is a perfect example where the scientific community, big tech companies, and experts have collaborated to leverage this technology for the benefit of Earth sciences. The magic happens when earth observation data experts, machine learning experts, data scientists, and HPC engineers come together."