Blog Reviews

A Dive into Vision-Language Models

Background:

Introducing joint vision-language models focusing on how they're trained

vision-language” model has their ability to process both images (vision) and natural language text (language). This process depends on the inputs, outputs, and the task these models are asked to perform.

drawing

How to predict? The model needs to understand both the input image and the text prompts.

Various Form Example:

a.         Image retrieval from natural language text.

  1. Phrase grounding, i.e., performing object detection from an input image and natural language phrase (example: A young person swings a bat).
  2. Visual question answering, i.e., finding answers from an input image and a question in natural language.
  3. Generate a caption for a given image. This can also take the form of conditional text generation, where you'd start with a natural language prompt and an image.
  4. Detection of hate speech from social media content involving both images and text modalities.

 

1.    Methodology

Five core learning strategies for training vision-language models (VLMs):

  1. Contrastive Learning

It aims to align representations of images and text into a shared embedding space, where semantically similar pairs (e.g., an image and its caption) are close, while dissimilar pairs are far apart.

Contrastive Learning

a)    Goal: Align image and text embeddings into a shared space (e.g., CLIP, ALIGN).

b)    Loss: Contrastive loss (e.g., cosine similarity) to minimize distance between matched pairs and maximize for mismatched pairs.

c)     Strengths: Enables zero-shot generalization (e.g., image classification).

 

  1. PrefixLM

PrefixLM

a)    Process: Treat image patches as a prefix to text sequences, training autoregressive models (e.g., SimVLM, Frozen).

b)    Use Case: Image captioning, VQA.

c)     Variants: "Frozen" methods freeze pre-trained language models (LMs) and train only image encoders (e.g., ClipCap).

 

 

  1. Frozen PrefixLM

Frozen PrefixLM

 

  1. Multi-modal Fusing with Cross-Attention

Cross Attention Fusing

 

a)    Mechanism: Inject visual embeddings into LM layers via cross-attention (e.g., VisualGPT, Flamingo).

b)    Advantage: Balances text generation with visual context without massive datasets.

 

  1. Masked-Language Modeling (MLM) / Image-Text Matching (ITM)

MLM / ITM

a.     Tasks:

                                               i.     MLM: Predict masked text tokens using image context.

                                              ii.     ITM: Classify if image-text pairs match.

b.    Models: VisualBERT, FLAVA (combines MLM, ITM, and contrastive loss).

 

  1. No Training

ASIF

Approach: Use frozen pre-trained models (e.g., ASIF uses similarity search; MaGiC uses CLIP for iterative optimization).

 

2. Model Characteristics

Key models and their features:

Common Architectural Traits:

3. Gaps & Limitations

4. Future Research Opportunities

  1. Expanded Modalities:
  2. Efficient Training:
  3. Domain-Specific Applications:
  4. Improved Alignment Strategies:

 

  1. Ethical & Robust Models:

5. Practical Takeaways