MSL NAVCAM image of Mars’ landscape. (b) Crowd-sourced segmentation masks superimposed on Martian landscape. Figure 3: The AI4Mars dataset [20] provides access to image captures of Mars’ terrain with crowd-sourced annotations for four terrain classes: “regolith”, “sand”, “bedrock”, “large rock(s)”. Terrain beyond 30m is left unlabeled. needed for extraterrestrial applications. As a first step towards a space foundation model, we demon- strate the opportunity for FMs to mitigate data scarcity by synthetically augmenting extraterrestrial science datasets, such as AI4Mars. Specifically, we generate a multi-modal dataset comprised of 150k QA tuples designed to emulate the detailed sensory reasoning required for tasks like identifying sites of scientific interest. We fine-tune an open-source Vision- Language Model (VLM) on our synthetic dataset, herein referred to as the Space-LLaVA dataset, and demonstrate the model’s utility by providing language annotations on planetary observations and tasks withheld from training. That lunar rover. Our evaluations demonstrate that: 1) existing VLMs are deficient visual reasoners in extraterrestrial applications; 2) our Space-LLaVA dataset endows a SoTA VLM with zero-shot performance increases servicing unseen extraterrestrial task types through instruction-tuning; 3) a small percentage, e.g., 20%, of the pre-training data is sufficient to safeguard against catastrophic forgetting; 4) FMs can be effectively integrated into modular autonomy stacks to enable embodied high-level planning in space robotics. 2. RELATED WORK Vision-Language Models: The advent of the Transformer [24] and derivative architectures, e.g., the Vision-Transformer [25], have powered recent advances in natural language and image processing through the use of VLMs trained on internet- scale text and image databases, e.g., Common Crawl and WebImageText [26]. Early work in vision-language modeling at scale [27] aligns a latent representation of vision and language by using a vision and text encoder with a contrastive learning objective; a VLM builds on this architecture by using a language model for open-ended visual reasoning such as VQA [28, 29, 23, 30]. In this work, we investigate adapting LLaVA-v1.5-13B [2] to extraterrestrial robotics through fine- tuning given this model is SoTA among open-source models on standard VQA benchmarks [18, 31]. Foundation Models in Robotics: Prior work has incorporated foundation models within the broader robot autonomy stack in various ways ranging from planning [9], decision making [32] and semantic reasoning [7, 6] to visual reasoning [33]. How- ever, the opportunity for foundation models in extraterrestrial robotics represents an emerging area of research. The Robot Operating System Agent [14] employs FMs to build a human- ܗΫϥεɿ wϨΰϦε SFHPMJUI w࠭ TBOE wؠ൫ CFESPDL wେ͖ͳؠ MBSHFSPDL T /"4"ͷՐ୳ࠪंɿ$VSJPTJUZ applications, we develop a VQA generation pipeline based on the AI4Mars and MICD [42] datasets supplemented by recent publications in astrophysics. Explicitly, we translate AI4Mars’ segmentation masks into visual context for GPT- assisted annotation of seven terrain-based, semantic tasks on Martian imagery, and inspired by cosmosage [44], we introduce our own QA dataset reflecting scientific insights and facts captured by publications in arXiv’s astrophysics category, e.g., Earth and Planetary Astrophysics, which we refer to as the SpaceScienceQA dataset. We first discuss our simple and scalable methodology to produce fine-grained sensory reasoning tasks on the AI4Mars dataset and MICD. Then, we detail our approach to syn- thetically generate high-quality science QA pairs for our SpaceScienceQA dataset. Our full dataset’s composition based on prompt style and the designed fine-tuning tasks is presented in Figure 5. GPT-assisted Annotation: AI4Mars & MICD Datasets We translate the high-quality, segmentation masks afforded by the AI4Mars dataset, as shown in Figure 3b, into seven distinct, semantic-reasoning tasks through the use of GPT- (a) Terrain Description: GPT-4o annotates a candidate AI4Mars landscape with a description of the terrain in view. Figure 7: Our SpaceScienceQA dataset offers QA tuples evaluating a language model’s understanding of scientific insights and facts in astrophysics. Verbose sections of the question and answer are omitted for brevity. assisted image annotation. These seven tasks, e.g., terrain comparison, listed fully in Section A, are designed to support Space-LLaVA as a tool for annotating planetary imagery, whose terrain-aware annotations may be used downstream by a specialized, task-specific ML algorithm. For each task, we design a total of ten questions to accomplish the same objective with varied prose, e.g., if the task is scene description, then we may pose the question as 1) “describe the landscape in view.” or 2) “what do you see in this image?”, etc., so as to discourage over-fitting to a particular prompt’s writing style in adaptation, i.e., fine-tuning. Before we query GPT-4o to perform e.g., terrain comparison, for a particular image, we first superimpose the appropriate terrain segmentation mask(s) on the original MSL NAVCAM image to color-code the landscape, as shown in Figure 6, creating visual context to support GPT-4o’s analysis. Through the use of visual context and additional language context provided in the prompt, we request the desired annotation in a format that is readily discernible zero-shot by a SoTA VLM like GPT-4o, i.e., the requested annotation does not require prior, expert knowledge to answer the question. Importantly, all visual and language context is only provided to GPT-4o to promote high-quality data curation; this same context is withheld from training Space-LLaVA as these features are not available at inference. Further details on the specific prompt used for data curation, e.g., the user and system message, are provided in Section A. Then, with the MICD dataset, we have the inverse problem: dataset and MICD. Then, we detail our approach to syn- thetically generate high-quality science QA pairs for our SpaceScienceQA dataset. Our full dataset’s composition based on prompt style and the designed fine-tuning tasks is presented in Figure 5. GPT-assisted Annotation: AI4Mars & MICD Datasets We translate the high-quality, segmentation masks afforded by the AI4Mars dataset, as shown in Figure 3b, into seven distinct, semantic-reasoning tasks through the use of GPT- (a) Terrain Description: GPT-4o annotates a candidate AI4Mars landscape with a description of the terrain in view. (b) Grain Characterization: GPT-4o annotates a candidate AI4Mars landscape by detailing the size and arrangement of particles Space-LLaVA as whose terrain-awa a specialized, task design a total of ten with varied prose then we may pose in view.” or 2) “wh discourage over-fi in adaptation, i.e. perform e.g., terr we first superimp mask(s) on the ori the landscape, as s support GPT-4o’s and additional lan request the desire discernible zero-s requested annotati to answer the que context is only pr data curation; thi Space-LLaVA as Further details on e.g., the user and s Then, with the M the MICD datase geological and ter and we simply mu precedes the answ which request a ca specific examples GPT-assisted Ann As a first step to e extraterrestrial sc to build an FM SpaceScienceQA scientific insights ܗهड़ͷ2" ཻࢠੳͷ2"