Review: Visual Prompt Engineering in Foundational
Models for AGI Applications
1. Introduction:
o Traces the evolution of foundational models (Transformer GPT ViT) and
emphasizes prompt engineering’s role in bridging pre-training and downstream tasks.
o Focus: Traces the evolution of foundational AI models (e.g., Transformer, GPT, ViT)
and their role in advancing visual prompt engineering.
o Key Argument: Prompt engineering bridges pre-trained models and downstream tasks,
enabling zero-shot generalization. Highlights the shift from NLP to CV, emphasizing
the need for adaptive interfaces like SAM and CLIP.
2. Background Knowledge:
o Defines prompts in NLP (cloze/prefix prompts) and CV (SAM, CLIP). Highlights
Transformer’s dominance in multi-modal tasks.
o 2.1 Prompts in NLP:
§ Defines cloze prompts (mid-text) and prefix prompts (end-of-text).
§ Discusses manual vs. automated prompt design (e.g., discrete/continuous
prompts like Prefix-Tuning[65]).
§ Limitations: Manual prompts are task-specific and labor-intensive;
automated methods risk overfitting
o 2.2 Foundation Models:
§ Transformer: Basis for ViT, enabling multi-modal tokenization (text,
images).
§ CLIP: Aligns text-image embeddings via contrastive learning on 400M
pairs.
§ VPT: Learns task-specific prompts in input space without modifying model
parameters.
§ SAM: Universal segmentation model using prompts (points, boxes) for
zero-shot generalization.
3. Visual Prompts Learning:
o Discusses methodologies: multi-modal prompts (CLIP, CoOP), visual tuning
(VPT), and hybrid systems (DenseCLIP, MaPLe).
3.1 Multi-Modal Models and Prompts:
o CLIP[49]: Uses fixed text prompts (e.g., "a photo of [class]") but struggles with
sensitivity.
o CoOP[117]: Trains continuous prompts for flexibility but overfits to downstream
tasks.
o DenseCLIP[118]: Transfers CLIP to dense tasks via pixel-text matching.
o MaPLe[119]: Jointly optimizes text-image prompts for cross-modal alignment.
Key Trend: Shift from manual to dynamic, context-aware prompting.
3.2 Visual Prompts:
o VPT[95]: Adds learnable parameters to inputs for efficient tuning of frozen models.
o AdaptFormer[137]: Integrates lightweight modules into ViT for action
recognition.
o COnvpass : tailoring pretrained ViT by implementing convolutional bypasses, to
reduce comptation expenses
o ViPT : Address the chalenge of limited large data in downstream multi modal
tracking task
o DAM-VP[140]: Uses meta-prompts to handle distribution shifts across domains.
Limitation: Balancing computational efficiency with adaptability (e.g.,
ConvPass[138]).
4. Visual Prompts in AGI:
o Applications in object detection (SAM’s crater segmentation), multi-modal fusion
(Text2Seg), and model combinations (Inpaint Anything).
4.1 Object Detection:
o Object Counting : Researchers adopt SAM by employing bounding boxes as
prompts to generate segmentation masks.
o Remote Sensing SAM : Zero-shot segmentation in medical imaging (e.g., tumor
delineation [54]), remote sensing (rotated bounding boxes [147]).
Limitations: Struggles with low-contrast or ambiguous semantics.
o SAM-Adapter[149]: Infuses domain knowledge (e.g., medical scans) into SAM.
4.2 Multi-modal Fusion:
o Text2Seg[147]: Combines text prompts (CLIP) with SAM for segmentation.
o SAMText : generating segmentation mask aimed at scene text in image or video
frame
o Caption Anything[150]: Framework Integrates SAM + ChatGPT for interactive
image captioning.
o SAA+[151]: Uses hybrid prompts for anomaly detection in industrial settings.
4.3 Combination of Models:
o Inpaint Anything[115]: Merges SAM, LaMa, and Stable Diffusion for image
editing.
o Edit Everything[152]: Edits images via SAM + CLIP + Stable Diffusion.
o SAM-Track[104]: Enables video object tracking with multi-modal prompts.
o Explain Any Concept: employing SAM for initial segmentation and introducing
a surrogate model for efficient explanation.
5. Future Directions:
o Advocates for adaptive tuning (reinforcement learning, adapters) and addresses
challenges (interactive environments, semantic sparsity)
o
5.1 Adaptation of LVMs:
o Key Techniques: Prompt fine-tuning, reinforcement learning, adapter modules,
and knowledge distillation.
o Goal: Improve efficiency (e.g., lightweight modules) and generalization (e.g.,
DAM-VP).
§ 1. Key Techniques
(a) Prompt Fine-Tuning
§ Definition: Adjusting input prompts (e.g., text, bounding boxes,
points) to guide LVMs toward domain-specific tasks without
retraining the entire model.
§ Remote Sensing Application:
§ Task-Specific Prompts: Use prompts like "segment
rectangular agricultural fields" or "detect irregularly
shaped water bodies" to improve SAM’s zero-shot
segmentation.
§ Multi-Modal Prompts: Combine text prompts (e.g., "urban
infrastructure") with visual prompts (e.g., rotated bounding
boxes for slanted rooftops).
§ Example: In SAM, rotated bounding boxes (R-Boxes) are used to
align with objects in top-down satellite imagery (Page 14).
(b) Reinforcement Learning (RL)
§ Definition: Training models to iteratively refine prompts based on
feedback (e.g., reward signals for accurate segmentation).
§ Remote Sensing Application:
§ Dynamic Prompt Adjustment: Train SAM to optimize
prompts for detecting occluded objects (e.g., vehicles under
tree cover) by rewarding high IoU (Intersection over Union)
scores.
§ Adaptive Exploration: Use RL to balance exploration
(trying new prompts) and exploitation (using effective
prompts) in diverse environments (e.g., deserts vs. forests).
§ Challenge: Designing reward functions that account for remote
sensing complexities (e.g., shadows, seasonal changes).
(c) Adapter Modules
§ Definition: Adding lightweight, task-specific layers to pre-trained
LVMs to adapt them to new domains.
§ Remote Sensing Application:
§ SAM-Adapter [149]: Inject domain knowledge (e.g.,
geospatial metadata) into SAM via small neural modules.
For example, adapters trained on desert imagery improve
segmentation of sand dunes.
§ Efficiency: Adapters add <1% parameters to SAM, enabling
deployment on satellites/drones with limited compute.
§ Example: A desert-specific adapter could reduce false positives in
arid regions by filtering out vegetation-like textures
(d) Knowledge Distillation
§ Definition: Transferring knowledge from large models (e.g., SAM)
to smaller, efficient models.
§ Remote Sensing Application:
§ Lightweight Models: Distill SAM’s segmentation
capabilities into a smaller ViT model for real-time crop
monitoring on drones.
§ Edge Deployment: Compress SAM for use in low-power
devices (e.g., CubeSats) by mimicking its mask prediction
behavior.
§ Example: A distilled SAM variant achieves 90% of the original
model’s accuracy on road detection tasks but runs 5x faster.
2. Goals
(a) Improve Efficiency
§ Lightweight Modules:
§ Use ConvPass [138] to add convolutional bypasses to ViT,
reducing SAM’s compute load for processing high-
resolution satellite imagery.
§ Deploy quantized SAM on edge devices (e.g., drones) for
real-time deforestation monitoring.
§ Selective Prompting: Only activate resource-intensive model
components when needed (e.g., SAM’s mask decoder for complex
objects).
(b) Enhance Generalization
§ DAM-VP (Diversity-Aware Meta Visual Prompting) [140]:
§ Clustering: Group remote sensing data into subsets (e.g.,
urban, agricultural, coastal) and train specialized prompts for
each.
§ Meta-Learning: Use DAM-VP to quickly adapt prompts to
new regions (e.g., transferring forest segmentation prompts
to mangrove detection).
§ Cross-Domain Tuning: Pre-train SAM on multi-modal remote
sensing datasets (e.g., RGB + infrared) to handle sensor variations.
3. Case Study: SAM in Remote Sensing
§ Problem: SAM struggles with arbitrarily oriented objects (e.g.,
ships, slanted buildings) in satellite imagery.
§ Solution:
1. Rotated Bounding Box Prompts: Use R-Boxes to guide
SAM’s segmentation (Page 14).
2. Adapter Fine-Tuning: Train an adapter on maritime
datasets to improve ship detection.
3. Knowledge Distillation: Create a lightweight SAM variant
for deployment on low-orbit satellites.
§ Result: SAM achieves 85% accuracy in segmenting ships with
rotated prompts, compared to 62% with default horizontal boxes.
4. Challenges & Future Work
§ Data Scarcity: Limited labeled remote sensing datasets for niche
tasks (e.g., glacier monitoring).
§ Solution: Use synthetic data generated by diffusion models
(e.g., Stable Diffusion) to augment training.
§ Real-Time Processing: High-resolution satellite imagery demands
efficient computation.
§ Solution: Hybrid architectures (e.g., SAM + lightweight
CNNs) for parallel processing.
§ Ethical Risks: Biases in prompts (e.g., misclassifying informal
settlements as "non-residential").
§ Solution: Fairness audits and inclusive prompt design (e.g.,
community-driven labeling).
5.2 Challenges for Visual AGI:
o Gaps: Lack of interactive environments (vs. NLP), semantic sparsity in images,
domain shifts.
o Proposal: Unified frameworks inspired by NLP (e.g., generative pre-training with
multimodal fusion).
§ 1. Lack of Interactive Environments
§ Challenge:
Unlike NLP systems (e.g., ChatGPT), which thrive on iterative user
interactions (e.g., refining prompts via dialogue), remote sensing
models lack frameworks for dynamic user engagement. For
example:
§ A user analyzing satellite imagery cannot iteratively guide a
model to refine segmentation masks of "urban sprawl" or
"deforested areas" in real time.
§ Current models like SAM operate statically, producing
outputs without adapting to user feedback.
§ Proposal:
§ Interactive Prompting: Develop systems where users can
click, draw, or describe regions of interest (e.g., "Segment
all oil tanks in this SAR image") to iteratively refine results.
§ Feedback Loops: Integrate reinforcement learning (RL) to
let models learn from user corrections (e.g., rewarding
accurate detection of ships in cloudy imagery).
2. Semantic Sparsity in Images
§ Challenge:
Remote sensing images often exhibit sparse, scattered objects
against vast backgrounds (e.g., ships in oceans, vehicles in deserts).
Key issues include:
§ Low Object Density: Most pixels lack meaningful
information (e.g., empty agricultural fields).
§ Scale Variability: Objects range from small (e.g., cars) to
large (e.g., wind farms).
§ Ambiguity: Similar spectral signatures for different objects
(e.g., water bodies vs. shadows).
§ Proposal:
§ Generative Pre-training: Train models on diverse datasets
(e.g., combining Sentinel-2, Landsat, and UAV imagery) to
learn robust feature representations of sparse objects.
§ Attention Mechanisms: Use transformer-based
architectures (e.g., Vision Transformers) to focus
computational resources on regions of interest.
§ Multimodal Fusion: Integrate ancillary data (e.g., elevation
models, weather data) to resolve ambiguities (e.g.,
distinguishing lakes from shadows using terrain height).
3. Domain Shifts
§ Challenge:
Remote sensing models struggle with variations across:
§ Sensors: Optical vs. SAR imagery have截然不同的
characteristics (e.g., SAR highlights texture; optical captures
color).
§ Geographies: A model trained on European urban areas
may fail on African rural landscapes.
§ Temporal Changes: Seasonal variations (e.g., snow cover
vs. summer vegetation) alter object appearances.
§ Proposal:
§ Unified Frameworks:
§ Cross-Sensor Alignment: Use contrastive learning
to map features from different sensors (e.g., optical
+ SAR) into a shared embedding space.
§ Domain-Invariant Prompts: Design prompts that
generalize across regions (e.g., "Segment buildings"
instead of "Segment European-style buildings").
§ Continual Learning: Enable models to adapt to new
geographies or seasons without forgetting previous
knowledge.
Implementation Strategies for Remote Sensing AGI
7. Interactive Systems:
§ Build tools like SAM + ChatGPT for Geospatial Analysis,
where users describe targets in natural language (e.g., “Find
all deforested patches in this rainforest”) and refine results
through dialogue.
8. Generative Pre-training:
§ Train models on massive, diverse datasets (e.g., ESA’s Phi-
Lab collections) using masked autoencoders (MAE) to
predict missing regions in satellite imagery.
9. Multimodal Fusion:
§ Combine satellite data with LiDAR, social media feeds, or
IoT sensor data (e.g., soil moisture sensors) to enrich
context. For example:
§ Fuse SAR data (penetrates clouds) with optical
imagery to monitor deforestation in tropical regions.
10. Ethical & Scalable Deployment:
§ Address biases in training data (e.g., overrepresentation of
urban areas) and ensure models work equitably across global
regions.
§ Optimize models for edge devices (e.g., drones) using
techniques like quantization or knowledge distillation.
Example Use Case: Ship Detection in SAR Imagery
§ Challenge: Sparse ships in vast ocean scenes, sensor noise, and
varying ship sizes.
§ Solution:
§ Interactive Prompting: Let users click on suspected ship
locations to guide the model.
§ Multimodal Fusion: Integrate AIS (Automatic
Identification System) data to validate detections.
§ Domain Adaptation: Pre-train on global SAR datasets to
handle regional variations (e.g., icebergs in polar regions vs.
ships in tropical waters).
Conclusion
The path to Visual AGI in remote sensing requires bridging the gap between
static models and dynamic, user-centric systems. By adopting NLP-inspired
strategies—such as interactive prompting, generative pre-training, and
multimodal fusion—researchers can overcome semantic sparsity, domain
shifts, and interaction limitations. These advancements will unlock
applications like real-time disaster monitoring, precision agriculture, and
global environmental tracking, making geospatial AI more accessible and
impactful.
5.3 Cross-Domain Applications:
o Healthcare: SAM for automated tumor segmentation [169].
o Agriculture: Crop monitoring via SAM-based drones [170].
§ Current Use:
§ Crop Monitoring: SAM generates segmentation masks for crops,
weeds, and pests using point/box prompts ([170]).
§ Livestock Tracking: SAM-based object tracking monitors animal
behavior (e.g., broiler bird movement patterns [171]).
§ Challenges:
§ Complex outdoor environments (e.g., occlusions, variable lighting).
§ Limited annotated agricultural datasets.
§ Implementation:
§ Domain-Specific Tuning: Fine-tuning SAM on agricultural
datasets (e.g., drone-captured crop images).
§ Multi-Modal Fusion: Integrating weather data or soil sensors with
SAM’s visual prompts for predictive analytics.
o Robotics: SAM for real-time object manipulation [112].
6. Conclusion:
o Summary: Visual prompt engineering is pivotal for AGI, enabling flexible,
efficient, and human-aligned model behavior.
o Future Work: Address generalization gaps, integrate interdisciplinary methods
(NLP + CV), and expand real-world deployments.
7. Methodology Breakdown
1. Computational Modeling (Algorithm Development)
o Definition: Designing and optimizing algorithms or architectures (e.g., vision
models, prompt-tuning methods) to address specific tasks.
o Examples:
§ Visual Prompt Tuning (VPT) [95]: Introduced task-specific learnable
prompts in input space for efficient fine-tuning of pre-trained vision models.
§ SAM (Segment Anything Model) [53]: A universal segmentation model
trained on diverse datasets using prompt engineering.
§ Multi-modal Prompt Learning (MaPLe) [119]: Combined text and
image prompts for cross-modal understanding.
o Limitations: Heavy reliance on large-scale datasets (e.g., CLIP’s 400M image-text
pairs), computational costs, and overfitting risks (e.g., CoOP’s reduced
generalization on new data).
2. Experimental Evaluation
o Definition: Testing models on benchmark datasets or real-world applications to
measure performance.
o Examples:
§ SAM in medical imaging [98, 100]: Evaluated SAM’s zero-shot
segmentation accuracy against manual clinical delineation.
§ SAM-Adapter [149]: Tested domain-specific knowledge infusion for
improved segmentation in pseudocolor object detection.
o Limitations: Context-specific performance (e.g., SAM struggles in low-contrast
environments [Page 13]) and dataset biases (e.g., remote sensing image orientation
challenges [Page 14]).
3. Systematic Literature Review
o Definition: Synthesizing existing research to identify trends, gaps, and
methodologies.
o Examples:
§ The review itself analyzed 500 papers on "visual prompt" (filtered via
ChatGPT) to map advancements in prompt engineering [Page 3].
o Limitations: Potential selection bias (e.g., reliance on arXiv papers) and exclusion
of non-English or unpublished work.
4. Hybrid Model Integration
o Definition: Combining foundational models (e.g., SAM + CLIP + Stable
Diffusion) for enhanced task performance.
o Examples:
§ Inpaint Anything [115]: Integrated SAM, LaMa, and Stable Diffusion for
image inpainting.
§ Edit Everything [152]: Merged SAM, CLIP, and Stable Diffusion for text-
guided image editing.
o Limitations: Increased complexity and computational overhead.
8. Methodological Trends
Dominant: Computational modeling (e.g., ViT-based architectures, prompt-tuning
frameworks) and experimental benchmarking.
Emerging:
o Hybrid systems (e.g., combining SAM with language models like ChatGPT
[150]).
o Dynamic prompting (e.g., DAM-VP’s diversity-aware meta prompts [140]).
o Reinforcement learning integration (e.g., optimizing prompts via feedback
signals [Page 17]).
o II. Research Gaps & Limitations
Explicitly Stated Gaps
5. Generalization in Complex Environments:
§ Quote: “SAM’s ability to generalize in complex application scenarios ...
may not meet task requirements” [Page 13].
6. Domain Adaptation:
§ Quote: “Handling complex distribution shifts from the original pre-training
data” requires diversity-aware prompts [Page 12].
7. Interactive Environments for CV:
§ Quote: “The CV domain lacks a clear path and lacks interactive
environments” compared to NLP [Page 18].
Implicit Gaps
8. Real-World Deployment: Limited discussion on real-time performance or
hardware constraints (e.g., SAM’s computational cost in robotics [112]).
9. Ethical Considerations: No mention of biases in prompt design (e.g., CLIP’s
sensitivity to fixed manual prompts [Page 9]).
10. Cross-Domain Evaluation: Few studies test models across drastically different
domains (e.g., natural vs. medical images).
Limitations of Existing Research
o Sample Size/Scope: Many studies use narrow datasets (e.g., medical imaging [54],
agriculture [170]).
o Generalizability: Overfitting in automatic prompts (e.g., CoOP [117]) and
domain-specific tuning (e.g., SAM-Adapter [149]).
o Bias: Manual prompt design relies on “professional experience” [Page 6],
introducing subjectivity.
9. Future Research Opportunities
Specific Research Questions
Enhanced Generalization:
§ Example: “Investigate SAM’s adaptability to low-contrast medical images
using reinforcement learning with hybrid text-visual prompts.”
Efficiency Optimization:
§ Example: “Develop lightweight prompt modules for real-time deployment
in robotics using knowledge distillation [163].”
Ethical Frameworks:
§ Example: “Evaluate bias mitigation in CLIP-like models through fairness-
aware prompt engineering.”
10. Interdisciplinary Opportunities
o Healthcare + CV: Integrate SAM with radiology workflows for automated tumor
segmentation [169].
o NLP + CV: Design unified frameworks for joint text-image prompting (e.g.,
ChatGPT + SAM [150]).
o Agriculture + Robotics: Deploy SAM-based crop monitoring systems on drones
[170].