LLMs powering Computer Vision tasks

💡

This is a community post by Prashant Kumar

Computer vision turns visual data into valuable insights, making it exciting. Now, with Large Language Models (LLMs), combining vision and language opens up even more possibilities. Before we explore how LLMs are powering computer vision, let’s first take a quick look at how they differ.

What are LLMs and Computer Vision models?

Large Language Models (LLMs) are AI systems trained on a lot of text, which helps them understand and generate human language. They have many parameters that allow them to do tasks like chatting, translating languages, or helping with code writing.

Computer Vision is a branch of AI that helps machines understand visual data, like images or videos. It involves tasks such as identifying objects (like recognizing a dog or cat), breaking images into specific parts (like pinpointing the exact pixels of a dog), and using this information to make decisions, such as guiding self-driving cars or diagnosing medical conditions.

What is Multimodal RAG?

Multimodal RAG (Retrieval Augmented Generation) is about using different types of data, like text and images, together to help models give better answers. Instead of just using text, the model can also look at relevant images to understand things better.

For example, if you give the model a picture of a car and also show it examples of captions for cars, it can come up with more detailed and useful captions for that image. This method helps the model create more accurate and richer responses by combining different kinds of information.

In LanceDB's vectordb-recipes, there's a section focused on multimodal examples, highlighting various use cases that leverage the multimodal capabilities of LLMs by combining different types of data, such as text and images, to address a range of problems.

How are LLMs transforming the field of Computer Vision?

LLMs are powering the Computer Vision field in two major ways:
First, they manage different models—such as vision and audio—so they can work together smoothly. This helps streamline tasks and makes the whole process more efficient.

Second, LLMs enhance the flexibility of vision tasks. For example, models like GPT-4 with vision can turn sketches into HTML code without extra fine-tuning, and they can also analyze high-resolution images with much greater detail.

These advancements are improving tasks like answering questions about images, finding objects in images, and classifying images, where LLMs break down complex problems into smaller steps and combine the results. As these models continue to evolve, they will make computer vision even more powerful and adaptable across a wide range of industries.

In LanceDB's vectordb-recipes, a few examples demonstrate some of these tasks. Let's explore them one by one, Let's start

Zero-Shot Object Localization and Detection with OpenAI's CLIP

This example is about how to perform object detection on images using CLIP and vector search. The process will be broken down into two simple steps:

First, the user will enter the name of the object they want to detect.
Then, a vector search will be done to find images that match the query.
Finally, the most similar image will be used to detect the object from the query.

Object detection with CLIP follows a process similar to YOLO. Here's a simple breakdown:
1. Split the Image into Patches: The image is divided into smaller sections for easier analysis.
2. Analyze Patches with CLIP: CLIP processes each patch using a sliding window approach to understand the features.
3. Calculate Coordinates: CLIP determines the coordinates (Xmin, Ymin, Xmax, Ymax) for the object’s bounding box.
4. Draw the Bounding Box: Finally, the bounding box is drawn on the image to highlight the detected object.

This process helps CLIP accurately detect and locate objects within an image.

Cambrian-1: Vision-centric exploration of images

https://www.kaggle.com/code/prasantdixit/cambrian-1-vision-centric-exploration-of-images/

This example explores images through a Vision-Centric approach using vector search. The process involves two simple steps:

Performing Vector Search: We’ll first search for images that match the query.
Vision-Centric Exploration: Then, we’ll use the retrieved images for further exploration and analysis.

Cambrian-1 is a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can boost multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research.

Read more about this example - https://blog.lancedb.com/cambrian-1-vision-centric-exploration/

This example uses the Conceptual Captions dataset by Google Research. The image descriptions in the dataset are primarily used to search for relevant images. Once we find the matching image, we'll use it as a social media post and generate engaging, creative captions for it.

These are just a few examples of how Computer Vision and LLMs work together. LanceDB's vectordb-recipes that explore different ways of using LLMs with images to unlock new possibilities.

Conclusion

There’s a lot of research on models that combine LLMs and computer vision, but the big hype around computer vision with LLMs hasn’t happened yet. While there’s progress, we’re still in the early stages, kind of like where LLMs were before ChatGPT took off.