Review: Visual Prompt Engineering in Foundational

Models for AGI Applications

1. Introduction:

o Traces the evolution of foundational models (Transformer → GPT → ViT) and

emphasizes prompt engineering’s role in bridging pre-training and downstream tasks.

o Focus: Traces the evolution of foundational AI models (e.g., Transformer, GPT, ViT)

and their role in advancing visual prompt engineering.

o Key Argument: Prompt engineering bridges pre-trained models and downstream tasks,

enabling zero-shot generalization. Highlights the shift from NLP to CV, emphasizing

the need for adaptive interfaces like SAM and CLIP.

2. Background Knowledge:

o Defines prompts in NLP (cloze/prefix prompts) and CV (SAM, CLIP). Highlights

Transformer’s dominance in multi-modal tasks.

o 2.1 Prompts in NLP:

§ Defines cloze prompts (mid-text) and prefix prompts (end-of-text).

§ Discusses manual vs. automated prompt design (e.g., discrete/continuous

prompts like Prefix-Tuning[65]).

§ Limitations: Manual prompts are task-specific and labor-intensive;

automated methods risk overfitting

o 2.2 Foundation Models:

§ Transformer: Basis for ViT, enabling multi-modal tokenization (text,

images).

§ CLIP: Aligns text-image embeddings via contrastive learning on 400M

pairs.

§ VPT: Learns task-specific prompts in input space without modifying model

parameters.

§ SAM: Universal segmentation model using prompts (points, boxes) for

zero-shot generalization.

3. Visual Prompts Learning:

o Discusses methodologies: multi-modal prompts (CLIP, CoOP), visual tuning

(VPT), and hybrid systems (DenseCLIP, MaPLe).

3.1 Multi-Modal Models and Prompts:

o CLIP[49]: Uses fixed text prompts (e.g., "a photo of [class]") but struggles with

sensitivity.

o CoOP[117]: Trains continuous prompts for flexibility but overfits to downstream

tasks.

o DenseCLIP[118]: Transfers CLIP to dense tasks via pixel-text matching.

o MaPLe[119]: Jointly optimizes text-image prompts for cross-modal alignment.

Key Trend: Shift from manual to dynamic, context-aware prompting.

3.2 Visual Prompts:

o VPT[95]: Adds learnable parameters to inputs for efficient tuning of frozen models.

o AdaptFormer[137]: Integrates lightweight modules into ViT for action

recognition.

o COnvpass : tailoring pretrained ViT by implementing convolutional bypasses, to

reduce comptation expenses

o ViPT : Address the chalenge of limited large data in downstream multi modal

tracking task

o DAM-VP[140]: Uses meta-prompts to handle distribution shifts across domains.

Limitation: Balancing computational efficiency with adaptability (e.g.,

ConvPass[138]).

4. Visual Prompts in AGI:

o Applications in object detection (SAM’s crater segmentation), multi-modal fusion

(Text2Seg), and model combinations (Inpaint Anything).

4.1 Object Detection:

o Object Counting : Researchers adopt SAM by employing bounding boxes as

prompts to generate segmentation masks.

o Remote Sensing SAM : Zero-shot segmentation in medical imaging (e.g., tumor

delineation [54]), remote sensing (rotated bounding boxes [147]).

Limitations: Struggles with low-contrast or ambiguous semantics.

o SAM-Adapter[149]: Infuses domain knowledge (e.g., medical scans) into SAM.

4.2 Multi-modal Fusion:

o Text2Seg[147]: Combines text prompts (CLIP) with SAM for segmentation.

o SAMText : generating segmentation mask aimed at scene text in image or video

frame

o Caption Anything[150]: Framework Integrates SAM + ChatGPT for interactive

image captioning.

o SAA+[151]: Uses hybrid prompts for anomaly detection in industrial settings.

4.3 Combination of Models:

o Inpaint Anything[115]: Merges SAM, LaMa, and Stable Diffusion for image

editing.

o Edit Everything[152]: Edits images via SAM + CLIP + Stable Diffusion.

o SAM-Track[104]: Enables video object tracking with multi-modal prompts.

o Explain Any Concept: employing SAM for initial segmentation and introducing

a surrogate model for efficient explanation.

5. Future Directions:

o Advocates for adaptive tuning (reinforcement learning, adapters) and addresses

challenges (interactive environments, semantic sparsity)

5.1 Adaptation of LVMs:

o Key Techniques: Prompt fine-tuning, reinforcement learning, adapter modules,

and knowledge distillation.

o Goal: Improve efficiency (e.g., lightweight modules) and generalization (e.g.,

DAM-VP).

§ 1. Key Techniques

(a) Prompt Fine-Tuning

§ Definition: Adjusting input prompts (e.g., text, bounding boxes,

points) to guide LVMs toward domain-specific tasks without

retraining the entire model.

§ Remote Sensing Application:

§ Task-Specific Prompts: Use prompts like "segment

rectangular agricultural fields" or "detect irregularly

shaped water bodies" to improve SAM’s zero-shot

segmentation.

§ Multi-Modal Prompts: Combine text prompts (e.g., "urban

infrastructure") with visual prompts (e.g., rotated bounding

boxes for slanted rooftops).

§ Example: In SAM, rotated bounding boxes (R-Boxes) are used to

align with objects in top-down satellite imagery (Page 14).

(b) Reinforcement Learning (RL)

§ Definition: Training models to iteratively refine prompts based on

feedback (e.g., reward signals for accurate segmentation).

§ Remote Sensing Application:

§ Dynamic Prompt Adjustment: Train SAM to optimize

prompts for detecting occluded objects (e.g., vehicles under

tree cover) by rewarding high IoU (Intersection over Union)

scores.

§ Adaptive Exploration: Use RL to balance exploration

(trying new prompts) and exploitation (using effective

prompts) in diverse environments (e.g., deserts vs. forests).

§ Challenge: Designing reward functions that account for remote

sensing complexities (e.g., shadows, seasonal changes).

§ Definition: Adding lightweight, task-specific layers to pre-trained

LVMs to adapt them to new domains.

§ Remote Sensing Application:

§ SAM-Adapter [149]: Inject domain knowledge (e.g.,

geospatial metadata) into SAM via small neural modules.

For example, adapters trained on desert imagery improve

segmentation of sand dunes.

§ Efficiency: Adapters add <1% parameters to SAM, enabling

deployment on satellites/drones with limited compute.

§ Example: A desert-specific adapter could reduce false positives in

arid regions by filtering out vegetation-like textures

(d) Knowledge Distillation

§ Definition: Transferring knowledge from large models (e.g., SAM)

to smaller, efficient models.

§ Remote Sensing Application:

§ Lightweight Models: Distill SAM’s segmentation

capabilities into a smaller ViT model for real-time crop

monitoring on drones.

§ Edge Deployment: Compress SAM for use in low-power

devices (e.g., CubeSats) by mimicking its mask prediction

behavior.

§ Example: A distilled SAM variant achieves 90% of the original

model’s accuracy on road detection tasks but runs 5x faster.

2. Goals

(a) Improve Efficiency

§ Lightweight Modules:

§ Use ConvPass [138] to add convolutional bypasses to ViT,

reducing SAM’s compute load for processing high-

resolution satellite imagery.

§ Deploy quantized SAM on edge devices (e.g., drones) for

real-time deforestation monitoring.

§ Selective Prompting: Only activate resource-intensive model

components when needed (e.g., SAM’s mask decoder for complex

objects).

(b) Enhance Generalization

§ DAM-VP (Diversity-Aware Meta Visual Prompting) [140]:

§ Clustering: Group remote sensing data into subsets (e.g.,

urban, agricultural, coastal) and train specialized prompts for

each.

§ Meta-Learning: Use DAM-VP to quickly adapt prompts to

new regions (e.g., transferring forest segmentation prompts

to mangrove detection).

§ Cross-Domain Tuning: Pre-train SAM on multi-modal remote

sensing datasets (e.g., RGB + infrared) to handle sensor variations.

3. Case Study: SAM in Remote Sensing

§ Problem: SAM struggles with arbitrarily oriented objects (e.g.,

ships, slanted buildings) in satellite imagery.

§ Solution:

1. Rotated Bounding Box Prompts: Use R-Boxes to guide

SAM’s segmentation (Page 14).

2. Adapter Fine-Tuning: Train an adapter on maritime

datasets to improve ship detection.

3. Knowledge Distillation: Create a lightweight SAM variant

for deployment on low-orbit satellites.

§ Result: SAM achieves 85% accuracy in segmenting ships with

rotated prompts, compared to 62% with default horizontal boxes.

4. Challenges & Future Work

§ Data Scarcity: Limited labeled remote sensing datasets for niche

tasks (e.g., glacier monitoring).

§ Solution: Use synthetic data generated by diffusion models

(e.g., Stable Diffusion) to augment training.

§ Real-Time Processing: High-resolution satellite imagery demands

efficient computation.

§ Solution: Hybrid architectures (e.g., SAM + lightweight

CNNs) for parallel processing.

§ Ethical Risks: Biases in prompts (e.g., misclassifying informal

settlements as "non-residential").

§ Solution: Fairness audits and inclusive prompt design (e.g.,

community-driven labeling).

5.2 Challenges for Visual AGI:

o Gaps: Lack of interactive environments (vs. NLP), semantic sparsity in images,

domain shifts.

o Proposal: Unified frameworks inspired by NLP (e.g., generative pre-training with

multimodal fusion).

§ 1. Lack of Interactive Environments

§ Challenge:

Unlike NLP systems (e.g., ChatGPT), which thrive on iterative user

interactions (e.g., refining prompts via dialogue), remote sensing

models lack frameworks for dynamic user engagement. For

example:

§ A user analyzing satellite imagery cannot iteratively guide a

model to refine segmentation masks of "urban sprawl" or

"deforested areas" in real time.

§ Current models like SAM operate statically, producing

outputs without adapting to user feedback.

§ Proposal:

§ Interactive Prompting: Develop systems where users can

click, draw, or describe regions of interest (e.g., "Segment

all oil tanks in this SAR image") to iteratively refine results.

§ Feedback Loops: Integrate reinforcement learning (RL) to

let models learn from user corrections (e.g., rewarding

accurate detection of ships in cloudy imagery).

2. Semantic Sparsity in Images

§ Challenge:

Remote sensing images often exhibit sparse, scattered objects

against vast backgrounds (e.g., ships in oceans, vehicles in deserts).

Key issues include:

§ Low Object Density: Most pixels lack meaningful

information (e.g., empty agricultural fields).

§ Scale Variability: Objects range from small (e.g., cars) to

large (e.g., wind farms).

§ Ambiguity: Similar spectral signatures for different objects

(e.g., water bodies vs. shadows).

§ Proposal:

§ Generative Pre-training: Train models on diverse datasets

(e.g., combining Sentinel-2, Landsat, and UAV imagery) to

learn robust feature representations of sparse objects.

§ Attention Mechanisms: Use transformer-based

architectures (e.g., Vision Transformers) to focus

computational resources on regions of interest.

§ Multimodal Fusion: Integrate ancillary data (e.g., elevation

models, weather data) to resolve ambiguities (e.g.,

distinguishing lakes from shadows using terrain height).

3. Domain Shifts

§ Challenge:

Remote sensing models struggle with variations across:

§ Sensors: Optical vs. SAR imagery have截然不同的

characteristics (e.g., SAR highlights texture; optical captures

color).

§ Geographies: A model trained on European urban areas

may fail on African rural landscapes.

§ Temporal Changes: Seasonal variations (e.g., snow cover

vs. summer vegetation) alter object appearances.

§ Proposal:

§ Unified Frameworks:

§ Cross-Sensor Alignment: Use contrastive learning

to map features from different sensors (e.g., optical

+ SAR) into a shared embedding space.

§ Domain-Invariant Prompts: Design prompts that

generalize across regions (e.g., "Segment buildings"

instead of "Segment European-style buildings").

§ Continual Learning: Enable models to adapt to new

geographies or seasons without forgetting previous

knowledge.

Implementation Strategies for Remote Sensing AGI

7. Interactive Systems:

§ Build tools like SAM + ChatGPT for Geospatial Analysis,

where users describe targets in natural language (e.g., “Find

all deforested patches in this rainforest”) and refine results

through dialogue.

8. Generative Pre-training:

§ Train models on massive, diverse datasets (e.g., ESA’s Phi-

Lab collections) using masked autoencoders (MAE) to

predict missing regions in satellite imagery.

9. Multimodal Fusion:

§ Combine satellite data with LiDAR, social media feeds, or

IoT sensor data (e.g., soil moisture sensors) to enrich

context. For example:

§ Fuse SAR data (penetrates clouds) with optical

imagery to monitor deforestation in tropical regions.

10. Ethical & Scalable Deployment:

§ Address biases in training data (e.g., overrepresentation of

urban areas) and ensure models work equitably across global

regions.

§ Optimize models for edge devices (e.g., drones) using

techniques like quantization or knowledge distillation.

Example Use Case: Ship Detection in SAR Imagery

§ Challenge: Sparse ships in vast ocean scenes, sensor noise, and

varying ship sizes.

§ Solution:

§ Interactive Prompting: Let users click on suspected ship

locations to guide the model.

§ Multimodal Fusion: Integrate AIS (Automatic

Identification System) data to validate detections.

§ Domain Adaptation: Pre-train on global SAR datasets to

handle regional variations (e.g., icebergs in polar regions vs.

ships in tropical waters).

Conclusion

The path to Visual AGI in remote sensing requires bridging the gap between

static models and dynamic, user-centric systems. By adopting NLP-inspired

strategies—such as interactive prompting, generative pre-training, and

multimodal fusion—researchers can overcome semantic sparsity, domain

shifts, and interaction limitations. These advancements will unlock

applications like real-time disaster monitoring, precision agriculture, and

global environmental tracking, making geospatial AI more accessible and

impactful.

5.3 Cross-Domain Applications:

o Healthcare: SAM for automated tumor segmentation [169].

o Agriculture: Crop monitoring via SAM-based drones [170].

§ Current Use:

§ Crop Monitoring: SAM generates segmentation masks for crops,

weeds, and pests using point/box prompts ([170]).

§ Livestock Tracking: SAM-based object tracking monitors animal

behavior (e.g., broiler bird movement patterns [171]).

§ Challenges:

§ Complex outdoor environments (e.g., occlusions, variable lighting).

§ Limited annotated agricultural datasets.

§ Implementation:

§ Domain-Specific Tuning: Fine-tuning SAM on agricultural

datasets (e.g., drone-captured crop images).

§ Multi-Modal Fusion: Integrating weather data or soil sensors with

SAM’s visual prompts for predictive analytics.

o Robotics: SAM for real-time object manipulation [112].

6. Conclusion:

o Summary: Visual prompt engineering is pivotal for AGI, enabling flexible,

efficient, and human-aligned model behavior.

o Future Work: Address generalization gaps, integrate interdisciplinary methods

(NLP + CV), and expand real-world deployments.

7. Methodology Breakdown

1. Computational Modeling (Algorithm Development)

o Definition: Designing and optimizing algorithms or architectures (e.g., vision

models, prompt-tuning methods) to address specific tasks.

o Examples:

§ Visual Prompt Tuning (VPT) [95]: Introduced task-specific learnable

prompts in input space for efficient fine-tuning of pre-trained vision models.

§ SAM (Segment Anything Model) [53]: A universal segmentation model

trained on diverse datasets using prompt engineering.

§ Multi-modal Prompt Learning (MaPLe) [119]: Combined text and

image prompts for cross-modal understanding.

o Limitations: Heavy reliance on large-scale datasets (e.g., CLIP’s 400M image-text

pairs), computational costs, and overfitting risks (e.g., CoOP’s reduced

generalization on new data).

2. Experimental Evaluation

o Definition: Testing models on benchmark datasets or real-world applications to

measure performance.

o Examples:

§ SAM in medical imaging [98, 100]: Evaluated SAM’s zero-shot

segmentation accuracy against manual clinical delineation.

§ SAM-Adapter [149]: Tested domain-specific knowledge infusion for

improved segmentation in pseudocolor object detection.

o Limitations: Context-specific performance (e.g., SAM struggles in low-contrast

environments [Page 13]) and dataset biases (e.g., remote sensing image orientation

challenges [Page 14]).

3. Systematic Literature Review

o Definition: Synthesizing existing research to identify trends, gaps, and

methodologies.

o Examples:

§ The review itself analyzed 500 papers on "visual prompt" (filtered via

ChatGPT) to map advancements in prompt engineering [Page 3].

o Limitations: Potential selection bias (e.g., reliance on arXiv papers) and exclusion

of non-English or unpublished work.

4. Hybrid Model Integration

o Definition: Combining foundational models (e.g., SAM + CLIP + Stable

Diffusion) for enhanced task performance.

o Examples:

§ Inpaint Anything [115]: Integrated SAM, LaMa, and Stable Diffusion for

image inpainting.

§ Edit Everything [152]: Merged SAM, CLIP, and Stable Diffusion for text-

guided image editing.

o Limitations: Increased complexity and computational overhead.

8. Methodological Trends

• Dominant: Computational modeling (e.g., ViT-based architectures, prompt-tuning

frameworks) and experimental benchmarking.

• Emerging:

o Hybrid systems (e.g., combining SAM with language models like ChatGPT

[150]).

o Dynamic prompting (e.g., DAM-VP’s diversity-aware meta prompts [140]).

o Reinforcement learning integration (e.g., optimizing prompts via feedback

signals [Page 17]).

o II. Research Gaps & Limitations

Explicitly Stated Gaps

5. Generalization in Complex Environments:

§ Quote: “SAM’s ability to generalize in complex application scenarios ...

may not meet task requirements” [Page 13].

6. Domain Adaptation:

§ Quote: “Handling complex distribution shifts from the original pre-training

data” requires diversity-aware prompts [Page 12].

7. Interactive Environments for CV:

§ Quote: “The CV domain lacks a clear path and lacks interactive

environments” compared to NLP [Page 18].

Implicit Gaps

8. Real-World Deployment: Limited discussion on real-time performance or

hardware constraints (e.g., SAM’s computational cost in robotics [112]).

9. Ethical Considerations: No mention of biases in prompt design (e.g., CLIP’s

sensitivity to fixed manual prompts [Page 9]).

10. Cross-Domain Evaluation: Few studies test models across drastically different

domains (e.g., natural vs. medical images).

Limitations of Existing Research

o Sample Size/Scope: Many studies use narrow datasets (e.g., medical imaging [54],

agriculture [170]).

o Generalizability: Overfitting in automatic prompts (e.g., CoOP [117]) and

domain-specific tuning (e.g., SAM-Adapter [149]).

o Bias: Manual prompt design relies on “professional experience” [Page 6],

introducing subjectivity.

9. Future Research Opportunities

Specific Research Questions

Enhanced Generalization:

§ Example: “Investigate SAM’s adaptability to low-contrast medical images

using reinforcement learning with hybrid text-visual prompts.”

Efficiency Optimization:

§ Example: “Develop lightweight prompt modules for real-time deployment

in robotics using knowledge distillation [163].”

Ethical Frameworks:

§ Example: “Evaluate bias mitigation in CLIP-like models through fairness-

aware prompt engineering.”

10. Interdisciplinary Opportunities

o Healthcare + CV: Integrate SAM with radiology workflows for automated tumor

segmentation [169].

o NLP + CV: Design unified frameworks for joint text-image prompting (e.g.,

ChatGPT + SAM [150]).

o Agriculture + Robotics: Deploy SAM-based crop monitoring systems on drones

[170].