
1. Introduction
• Problem Statement: Traditional AI models in remote sensing (RS) lack semantic
understanding for complex tasks.
• Solution: Vision-language models (VLMs) bridge visual and textual reasoning,
enabling tasks like image captioning, visual question answering (VQA), and cross-
modal retrieval.
• Scope: This summary explores VLM architectures, applications in RS, challenges,
and future directions.
2. Evolution of Vision-Language Models
2.1 From Vision-Centric to Multimodal Models
• Vision-Centric Models: CNNs and vision transformers (e.g., ResNet, ViT)
dominate feature extraction but lack textual reasoning.
• Large Language Models (LLMs): GPT, BERT, and T5 excel in text but lack visual
grounding.
• VLMs: Integrate vision and language through two primary architectures:
o Fusion Encoders (e.g., VisualBERT, ViLBERT): Merge visual-textual features
via cross-modal attention.
o Dual Encoders (e.g., CLIP, ALIGN): Align image-text embeddings in shared
latent space for efficient retrieval.
2.2 Foundation Models for RS
• Definition: Large pre-trained models (task-agnostic) for domain-specific feature
extraction.
o Supervised: MillionAID (1M labeled RS images).
o Self-Supervised: RingMo (contrastive learning for change detection,
segmentation).
• LLM-Integrated VLMs: Combine frozen LLMs (e.g., GPT) with vision encoders
(e.g., RSGPT, GeoChat) for captioning and VQA.
• Generative VLMs: Use GANs/diffusion models (e.g., Txt2Img-MHN) for synthetic
RS data generation.
3. Applications in Remote Sensing
3.1 Cross-Modal Understanding
• Image Captioning: Generate natural language descriptions of RS imagery (e.g.,
RSGPT with Q-Former alignment).
• Visual Question Answering (VQA): Answer queries about RS images using
transformer-based fusion (e.g., CLIP + BERT).
• Visual Grounding: Link textual queries to spatial regions (e.g., bounding box
detection via language-image encoders).