| Application Scope |
PrefixLM-based models are limited to tasks like captioning or VQA, not broader tasks |
SimVLM's architecture restricts it to image captioning/VQA, not object detection or segmentation |
Moderate |
Limits versatility in real-world applications requiring diverse vision-language tasks |
| Data Quality |
Noisy image-text pairs in datasets affect model performance |
ALIGN and LAION-5B create custom preprocessing/filtering to handle noise in web-scraped data |
High |
Reduces model generalization and necessitates costly dataset curation |
| Model Complexity |
MLM/ITM approaches require pre-trained object detectors for region proposals |
VisualBERT relies on Faster R-CNN for object detection, adding computational overhead |
High |
Increases deployment complexity and limits scalability |
| Domain Adaptation |
Scarcity of domain-specific aligned datasets |
Clinical-BERT struggles with limited medical image-text pairs |
High |
Hinders adoption in healthcare and robotics |
| Cross-Modal Fusion |
Cross-attention methods may lack efficiency in large-scale fusion |
FIBER's gating mechanism remains unproven for massive datasets |
Moderate |
Limits real-time applications on edge devices |
| Zero-Shot Transfer |
Frozen models still require adaptation layers |
Flamingo uses Perceiver Resampler layers requiring partial training |
Moderate |
Prevents true zero-shot generalization |
| Emerging Applications |
Nascent 3D/robotics tasks lack validation |
AvatarCLIP and OWL-ViT constrained by limited datasets |
Low |
Delays deployment in AR/VR and autonomous systems |
| Training Dependency |
"No training" methods still require minimal aligned data |
ASIF relies on small datasets for similarity search |
Moderate |
Limits applicability in data-scarce domains |