NLP progress inspires IBM and NASA
First, NASA hopes to speed up the processing of two types of data. On the one hand, it intends to optimize information retrieval in research articles prepared by its teams and the scientific community. On the other hand, it intends to use space-based remote sensing data of the Earth to observe and predict the evolution of the climate, as well as to develop applications useful for rescuers.
“Vegetable models are part of a larger initiative within IBM Research,” said Priya Nagpurkar, vice president of Hybrid Cloud Platform and Developer Productivity at IBM Research.
“This emerging technology, which can take in large amounts of unlabeled data, ‘learn’ to perform a task in one domain, and transfer that knowledge to another domain, greatly reduces the effort to create artificial intelligence,” he concludes. “It also eliminates the need to tag large volumes of data.”
Typically, GPT-3, the popular NLG model that spawned ChatGPT, falls into this model category. “At IBM Research, we believe it’s time to take these advances and apply them to a variety of techniques and areas that are particularly important to IBM’s businesses and customers, as well as to the advancement of science,” said Prya Nagpukar.
The equivalent of ChatGPT for researchers
Exploration of the research papers has already begun between NASA and IBM Research. The two institutions are developing a baseline model based on 300,000 articles published in scientific journals, including the American Geophysical Union (AGU) and American Meteorological Society (AMS) associations. The NLP model in question is “refined” (fine-tuning in VO). “We train our model on about one-tenth the amount of data used to train a model like GPT-3 because we focus on geoscience knowledge,” says Raghu Ganti, a researcher at IBM Research.
It will then be integrated with the open source PrimeQA toolkit to ask natural language questions and get answers. The promise is to provide a sourced summary of the latest research on a topic related to Earth observation.
The benchmarking tool included in PrimeQA has already shown that the model jointly developed by IBM Research and NASA performs better on training data than BERT-E and RoBERTa, two NLP models built on “transformers” developed and improved by Google and Meta. NASA “hopes to begin using this model in the middle of this year,” said Rahul Ramachandran, principal investigator at NASA’s Marshall Space Flight Center.
“We’re also thinking about how the model can be used to improve information and data discovery,” says the NASA researcher. “Because vector images (inputs in VO) contain an understanding of context, you can use this to improve your search results. Another big potential opportunity for us is creating meta documents, keyword annotation, etc. to enhance some of our data management activities. “, he adds.
Observing the evolution of the climate thanks to artificial intelligence is quite another matter
Processing of terrestrial remote sensing data is still in its infancy.
“You may not know this, but NASA has the largest collection of Earth observation data,” says Rahul Ramachandran. “Our data comes from a variety of instruments and the collection covers all sub-disciplines of the geosciences. “Currently, we have 70 petabytes of data in our archive, and by 2025 this number is expected to reach 250 petabytes.”
What is the reason for this increase? The US space agency plans to map the oceans and large water bodies starting this year.
“As part of the workshop in 2020, we learned how to combine artificial intelligence and machine learning in Earth observation,” said Rahul Ramachandran. “Two problems emerged: the lack of large training data sets for these models, and their failure to generalize their training over time and space. Foundation models have the potential to solve both of these problems.”
IBM and NASA have already begun experimenting with the Harmonized Landsat Sentinel (HLS) database, which contains images collected and cleaned from the Landsat and Sentinel-2 satellites. Next, the two partners will look at the data set of the MERRA-2 project, which is devoted to meteorological reanalysis of atmospheric observation data recorded since 1980.
“We are trying to develop a foundational model that can be used to develop a variety of applications: measuring landscape evolution, biomass estimation, flood and flood detection, etc. ”, explains Raghu Ganti. “We want to develop a single model that covers multiple regions and multiple time zones.”
These datasets combine, among other things, time series data, images, meteorological measurements and descriptions of the atmosphere. NASA’s Rahul Ramachandran believes that “Processing large amounts of scientific data with different attributes, including spatial and temporal dimensions, poses significant algorithmic challenges.”
“Transformers trained on texts will have to evolve to be trained on such data,” confirms Raghu Ganti. “But it’s something we’re actively looking into.”
IBM commits that the architecture for developing these models is based on the Red Hat OpenShift platform deployed in the AWS cloud. An NLP model developed using the PyTorch and Ray frameworks will be one of the “largest AI workloads running on OpenShift,” says Raghu Ganti. Because the researchers and IBM optimized the processing steps and did not need very large amounts of data, the training phase of the NASA model will take 6 hours on a cluster of 32 GPUs. “Our model’s database contains about 1 billion tokens, compared to about 50 billion for large open-source NLP models,” reasons a researcher at IBM.
Building a deep learning model on remote sensing data from NASA satellites and instruments is another challenge altogether.
“I think this partnership will help drive innovation in each of these areas, from infrastructure to infrastructure, not to mention advances in architecture and even data management techniques,” said Priya Nagpurkar.
A team dedicated to the project at IBM Research is already working with meteorologists from The Weather Company (the group’s weather subsidiary) to design applications and use cases that will be validated by NASA.