Qwen2-VL, an advanced vision language model built on Qwen2 [1], sets new benchmarks in image comprehension across varied resolutions and ratios, while also tackling extended video content.
Though Qwen2-V excels at many fronts, this article explores the model’s innovative features and its potential applications for video understanding and Q&A.
🔥 TL;DR 🔥
- We set up a pipeline to query a custom video using Qwen2-VL and shared the code for easy, quick setup.
- Qwen2-VL achieves state-of-the-art results on image understanding benchmarks (MathVista, DocVQA, RealWorldQA).
- It processes videos over 20 minutes.
- Multilingual support includes English, Chinese, Japanese, Korean, Arabic, and most European languages for text recognition in images.
- Claude 3.5 Sonnet and GPT-4o can process images (i.e., frames) but cannot process videos like Qwen2-VL.
Learn about the cutting edge of multimodality and foundation models in our CVPR 2024 series:
- Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more).
- Top Highlights You Must Know — Embodied AI, GenAI, Foundation Models, and Video Understanding.
⭐️ Don’t miss our Segment Anything Model (SAM 2) series:
Table of Contents
1. Introduction to Qwen2-VL and Vision Language Models
2. Qwen2-VL for Video Understanding and Q&A
3. What’s next?
1. Introduction to Qwen2-VL and Vision Language Models
1.1 Brief overview of Vision Language Models
Vision Language Models (VLMs) bridge the gap between visual and textual information. Unlike traditional models trained for specific tasks, VLMs are designed for versatility across various vision-language applications (see Figure 1).
VLMs are trained on massive datasets of image-text pairs to learn the correspondence between visual and textual elements. They employ different learning strategies:
- Contrastive Learning: Distinguish between matching (positive) and mismatched (negative) image-text pairs. Example: CLIP [2].
- Masking Objectives: Predict masked visual or textual elements based on the visible context. Example: FLAVA [3], MaskVLM [4].
- Generative Modeling: Generate images from text descriptions or vice versa. Example: CoCa [5], CM3leon [6].
- Pretrained Backbones: Utilize pre-trained language models (LLMs) like Llama to map image features to language representations. Example: MiniGPT [7].
On the other hand, VLM evaluation relies on benchmarks that assess their ability to connect visual and textual information such as:
- i) Vision-linguistic benchmarks measure the accuracy of image captioning, text-to-image consistency, visual question answering, zero-shot image classification, and visual reasoning.
- ii) Hallucination benchmarks evaluate the tendency of VLMs to generate incorrect or irrelevant text based on images.
1.2 Qwen2-VL: Purpose and Key Features
Qwen2-VL was built on the foundation of Qwen2 language models and aims to achieve state-of-the-art performance in understanding and interacting with both images and videos.
Qwen2-VL’s three primary features are:
- Advanced Image and Video Understanding: Qwen2-VL excels at comprehending images and videos. It can analyze images of varying resolutions and ratios, exceeding previous models’ limitations. It can understand videos exceeding 20 minutes, enabling it to answer complex questions.
- Multilingual Proficiency: Qwen2-VL supports understanding text within images in multiple languages. It’s capable of comprehending English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, and more.
- Multimodal Agent Capabilities: Qwen2-VL goes beyond passive understanding to become an active agent, capable of interacting with the world through visual cues and instructions:
- Function Calling: This model can use external tools for real-time data retrieval, responding to user queries by extracting information from visual sources like analyzing flight statuses in an image or checking weather forecasts based on a picture. 🛫 🌧
- Visual Interactions: Qwen2-VL takes a step towards mimicking human-like perception, allowing interaction with visual stimuli in a manner similar to how humans perceive the world. This opens up possibilities for more intuitive and immersive interactions, where the model can participate actively in visual experiences. 👓 🖼
Function calling refers to the capability to define and describe calls to external application programming interfaces (APIs). Microsoft
1.3 How Qwen2-VL Stands Out in the Current AI Landscape
In previous posts we have previously compared side by side (i.e., using both simple and challenging prompts for image understanding) multimodal models. Here, we’ll simply highlighted where Qwen2-VL excels.
Figure 2 shows how Qwen2-VL distinguishes from GPT4o and Claude 3.5 Sonnet mostly on its visual understanding both for video and for high-resolution images.
2. Qwen2-VL for Video Understanding and Q&A
Note: be aware that Colab’s free version might only work for Qwen2-VL 2B model. Qwen2-VL 7B will drain Colab’s free GPU RAM memory.
💻 You can run the set up and do inference in no time with this Jupyter Notebook we have created for you. 😎
🤖 We’ll use the video above, to ask questions such as:
- In which city or country is the event happening?
- What is the outfit’s colour of the gymnast?
2.1 How to set up Qwen2-VL
To set up Qwen2-VL you need:
- A specific version of the transformers library
- A utilities library to interact with Qwen2-VL
Next, Qwen2-VL can do inference either by analyzing the frames of a video or by ingesting the entire video itself. Hence, we need a library to extract the frames of a video.
We load the model, in our case Qwen2-VL 2B. You might want to check Qwen2-VL 7B and a 72B option (what!) that is available through API only.
Finally, the following function wraps a call to perform inference.
2.2 Video understanding (using frames)
Figure 4 illustrates the answers of Qwen2-VL using frames.
Other questions asked were:
- User: What is the outfit’s colour of the gymnast?
- Qwen2-VL: “The gymnast is wearing a white outfit with red and black stripes.”
- User: In which city or country is the event happening?
- Qwen2-VL: “The event is taking place in Paris, France.”
The responses appear to be pretty accurate, especially considering that we’re using the 2B parameter model option.
Despite being apples-oranges 🍎 🍊, what happens if we ask GPT-4o or Claude 3.5 Sonnet for the same questions?
In short, as Figure 5 shows, Claude 3.5 Sonnet seems to be as accurate as Qwen2-VL for one single frame.
However, what if you want to ask for specific queries that take place in the middle/end of the video? 🤔 Well, here’s where Qwen2-VL shines: you simply ask the question! GPT-4o and Claude Sonnet 3.5 are unable to ingest/process video (at least not yet). 📹
Google has a couple of early stage solutions for video understanding but we’ll leave them for a different blog post.
2.3 Video understanding (using the entire video)
As shown in Figure 6, non surprisingly Qwen2-VL is also pretty accurate. It provides even more details than the frame-level version. Given that the model’s input is now the entire video, it has more information to provide a more detailed response.
3. What’s next?
In this article we set up Qwen2-VL, a Vision Language model that can be used for video understanding.
Video search and understanding is an exciting area that is ripe for disruption. As we pointed out in our series Scalable Video Search: Cascading Foundation Models — Part 1, video is everywhere but the infrastructure to do robust search at scale is still under construction.
References
[1] Qwen2 Technical Report
[2] Learning Transferable Visual Models From Natural Language Supervision
[3] FLAVA: A Foundational Language And Vision Alignment Model
[4] Masked Vision and Language Modeling for Multi-modal Representation Learning
[5] CoCa: Contrastive Captioners are Image-Text Foundation Models
[6] Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
[7] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Authors: Dmitry Kazhdan, Jose Gabriel Islas Montero
If you’d like to know more about Tenyks, try sandbox.