Video Understanding: Qwen2-VL – An Expert Vision-Language Model

Jose Gabriel Islas Montero

September 24, 2024

min read

Video Understanding: Qwen2-VL – An Expert Vision-Language Model

Qwen2-VL, an advanced vision language model built on Qwen2 [1], sets new benchmarks in image comprehension across varied resolutions and ratios, while also tackling extended video content.

‍

Though Qwen2-V excels at many fronts, this article explores the model’s innovative features and its potential applications for video understanding and Q&A.

‍

🔥 TL;DR 🔥

We set up a pipeline to query a custom video using Qwen2-VL and shared the code for easy, quick setup.
Qwen2-VL achieves state-of-the-art results on image understanding benchmarks (MathVista, DocVQA, RealWorldQA).
It processes videos over 20 minutes.
Multilingual support includes English, Chinese, Japanese, Korean, Arabic, and most European languages for text recognition in images.
Claude 3.5 Sonnet and GPT-4o can process images (i.e., frames) but cannot process videos like Qwen2-VL.

‍

Learn about the cutting edge of multimodality and foundation models in our CVPR 2024 series:

Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more).
Top Highlights You Must Know — Embodied AI, GenAI, Foundation Models, and Video Understanding.

‍

⭐️ Don’t miss our Segment Anything Model (SAM 2) series:

SAM 2 + GPT-4o Cascading Foundation Models via Visual Prompting: Part 1 and Part 2.

‍

1. Introduction to Qwen2-VL and Vision Language Models

2. Qwen2-VL for Video Understanding and Q&A

3. What’s next?

‍

1. Introduction to Qwen2-VL and Vision Language Models

1.1 Brief overview of Vision Language Models

Vision Language Models (VLMs) bridge the gap between visual and textual information. Unlike traditional models trained for specific tasks, VLMs are designed for versatility across various vision-language applications (see Figure 1).

‍

Figure 1. VLM’s strong visual component makes them ideal for practical vision-related tasks

‍

VLMs are trained on massive datasets of image-text pairs to learn the correspondence between visual and textual elements. They employ different learning strategies:

Contrastive Learning: Distinguish between matching (positive) and mismatched (negative) image-text pairs. Example: CLIP [2].
Masking Objectives: Predict masked visual or textual elements based on the visible context. Example: FLAVA [3], MaskVLM [4].
Generative Modeling: Generate images from text descriptions or vice versa. Example: CoCa [5], CM3leon [6].
Pretrained Backbones: Utilize pre-trained language models (LLMs) like Llama to map image features to language representations. Example: MiniGPT [7].

‍

On the other hand, VLM evaluation relies on benchmarks that assess their ability to connect visual and textual information such as:

i) Vision-linguistic benchmarks measure the accuracy of image captioning, text-to-image consistency, visual question answering, zero-shot image classification, and visual reasoning.
ii) Hallucination benchmarks evaluate the tendency of VLMs to generate incorrect or irrelevant text based on images.

‍

1.2 Qwen2-VL: Purpose and Key Features

Qwen2-VL was built on the foundation of Qwen2 language models and aims to achieve state-of-the-art performance in understanding and interacting with both images and videos.

‍

Qwen2-VL’s three primary features are:

Advanced Image and Video Understanding: Qwen2-VL excels at comprehending images and videos. It can analyze images of varying resolutions and ratios, exceeding previous models’ limitations. It can understand videos exceeding 20 minutes, enabling it to answer complex questions.
Multilingual Proficiency: Qwen2-VL supports understanding text within images in multiple languages. It’s capable of comprehending English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, and more.
Multimodal Agent Capabilities: Qwen2-VL goes beyond passive understanding to become an active agent, capable of interacting with the world through visual cues and instructions:

Function Calling: This model can use external tools for real-time data retrieval, responding to user queries by extracting information from visual sources like analyzing flight statuses in an image or checking weather forecasts based on a picture. 🛫 🌧
Visual Interactions: Qwen2-VL takes a step towards mimicking human-like perception, allowing interaction with visual stimuli in a manner similar to how humans perceive the world. This opens up possibilities for more intuitive and immersive interactions, where the model can participate actively in visual experiences. 👓 🖼

‍

Function calling refers to the capability to define and describe calls to external application programming interfaces (APIs). Microsoft

‍

1.3 How Qwen2-VL Stands Out in the Current AI Landscape

In previous posts we have previously compared side by side (i.e., using both simple and challenging prompts for image understanding) multimodal models. Here, we’ll simply highlighted where Qwen2-VL excels.

‍

Figure 2 shows how Qwen2-VL distinguishes from GPT4o and Claude 3.5 Sonnet mostly on its visual understanding both for video and for high-resolution images.

Figure 2. Comparison between Qwen2-VL and other frontier models

‍

2. Qwen2-VL for Video Understanding and Q&A

Note: be aware that Colab’s free version might only work for Qwen2-VL 2B model. Qwen2-VL 7B will drain Colab’s free GPU RAM memory.

‍

💻 You can run the set up and do inference in no time with this Jupyter Notebook we have created for you. 😎

‍

🤖 We’ll use the video above, to ask questions such as:

In which city or country is the event happening?
What is the outfit’s colour of the gymnast?

‍

2.1 How to set up Qwen2-VL

To set up Qwen2-VL you need:

A specific version of the transformers library
A utilities library to interact with Qwen2-VL


pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate

pip install qwen-vl-utils

Next, Qwen2-VL can do inference either by analyzing the frames of a video or by ingesting the entire video itself. Hence, we need a library to extract the frames of a video.


# For Colab Notebooks
pip install ffmpeg -q # For a Colab Notebook

# For a Jupyter Notebook on your own GPU with a conda environment
conda install -c conda-forge ffmpeg -y

We load the model, in our case Qwen2-VL 2B. You might want to check Qwen2-VL 7B and a 72B option (what!) that is available through API only.


from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

Finally, the following function wraps a call to perform inference.


def query_video(prompt, use_frames=True, frames_path="/home/qwen2_vl/content/frames", video_path=None):
    if use_frames:
        # Get the frames
        selected_frames = get_frame_list(output_path)

        # Create messages structure for frames
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "video": selected_frames,
                        "fps": 1.0,
                    },
                    {"type": "text", "text": prompt},
                ],
            }
        ]
    else:
        # Create messages structure for the entire video
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "video": f"file://{video_path}",
                        "max_pixels": 360 * 420,
                        "fps": 1.0,
                    },
                    {"type": "text", "text": prompt},
                ],
            }
        ]

    print(f"Using {'frames' if use_frames else 'entire video'} for inference.")

    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Inference
    with torch.no_grad():  # Use no_grad to save memory during inference
        generated_ids = model.generate(**inputs, max_new_tokens=128)

    # Trim the generated output to remove the input prompt
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Decode the generated text
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    print(output_text)
    torch.cuda.empty_cache()

2.2 Video understanding (using frames)

Figure 4 illustrates the answers of Qwen2-VL using frames.

Figure 4. Qwen2-VL (**frame-level mode**) was queried with “Describe the video in detail”

2.3 Video understanding (using the entire video)

As shown in Figure 6, non surprisingly Qwen2-VL is also pretty accurate. It provides even more details than the frame-level version. Given that the model’s input is now the entire video, it has more information to provide a more detailed response.

Figure 6. Qwen2-VL (**entire video mode**) provided a **more detailed answer** than the **frame-level** alternative when queried with **“Describe the video in detail”.**

3. What’s next?

In this article we set up Qwen2-VL, a Vision Language model that can be used for video understanding.

‍

Video search and understanding is an exciting area that is ripe for disruption. As we pointed out in our series Scalable Video Search: Cascading Foundation Models — Part 1, video is everywhere but the infrastructure to do robust search at scale is still under construction.

‍