arxiv:2306.16410

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Published on Jun 28, 2023

· Submitted by

AK on Jun 29, 2023

Upvote

Authors:

William Berrios ,

Gautam Mittal ,

Tristan Thrush ,

Douwe Kiela ,

Amanpreet Singh

Abstract

LENS uses language models to reason over outputs from vision modules, achieving competitive performance in vision and vision-language tasks without multimodal training.

AI-generated summary

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

View arXiv page View PDF GitHub 354 Add to collection

Community

mikeerl

Jul 1, 2023

Hey, Im reviewing deep learning papers on twitter daily in Hebrew via hashtag #https://twitter.com/hashtag/shorthebrewpapereviews?src=hashtag_click. So far I've shortly reviewed about deep learning papers. You are invited to follow and comment

This paper review can be found at: https://twitter.com/MikeE_3_14/status/1675088525237051394?s=20