Visual Instruction Tuning

2 minute read

Published: April 17, 2023

Paper: Visual Instruction Tuning (LLaVA, NeurIPS 2023) — PDF

Motivation

Large language models have demonstrated remarkable instruction-following capabilities, but these have been largely confined to the text domain. LLaVA asks: can we extend instruction tuning to the multimodal setting, creating a general-purpose visual assistant that can follow natural language instructions about images?

Method

The core idea is visual instruction tuning — generating instruction-following data from image-text pairs using GPT-4/ChatGPT, then fine-tuning a vision-language model on this data.

Data generation pipeline:

Start with image-caption pairs (from COCO)
Feed the caption and image metadata (bounding boxes, object labels) to GPT-4
Prompt GPT-4 to generate diverse conversation data: questions, answers, detailed descriptions, and reasoning traces
This yields ~158K instruction-following samples spanning conversation, detailed description, and complex reasoning

Model architecture:

Vision encoder: pre-trained CLIP ViT-L/14
Language model: Vicuna (fine-tuned LLaMA)
A single linear projection layer connects vision features to the language embedding space
Only the projection layer is trained in stage 1; both the projection and LLM are fine-tuned in stage 2

Key Insights

Language-only instruction data is insufficient. Simply using text-based instruction data degrades performance — multimodal instruction data is essential.
GPT-4 generated data works remarkably well. Despite being synthetic, the instruction-following data produces a capable assistant that can reason about images, answer follow-up questions, and even refuse to answer unanswerable queries.
Simple architecture, strong results. A single linear projection between CLIP and Vicuna is enough — no complex fusion mechanisms needed. The model achieves ~92% of GPT-4’s performance on a multimodal instruction-following benchmark while being fully open-source.
Two-stage training is important. Pre-training the projection layer on image-caption alignment before fine-tuning on instruction data prevents catastrophic forgetting and leads to better convergence.

Thoughts

LLaVA is an elegant demonstration that the instruction-tuning paradigm transfers well to multimodal settings. The data generation trick — using a strong LLM to convert existing image annotations into diverse instruction data — is cost-effective and scalable. The simplicity of the architecture is also a strength; it shows that with good data, even a linear projector suffices.

One limitation worth noting: the visual understanding is bottlenecked by the CLIP encoder, which has known failure modes (e.g., spatial reasoning, counting). Improving the vision backbone would likely yield significant gains.

Xing Di

Motivation

Method

Key Insights

Thoughts