Vision-language Pre-training

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

We introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs’ zero-shot performance without the need for labeled task-aware data.