Video Understanding: Qwen2-VL – An Expert Vision-Language Model

September 24, 2024
6
 min read
Video Understanding: Qwen2-VL – An Expert Vision-Language Model

Qwen2-VL, an advanced vision language model built on Qwen2 [1], sets new benchmarks in image comprehension across varied resolutions and ratios, while also tackling extended video content.

Though Qwen2-V excels at many fronts, this article explores the model’s innovative features and its potential applications for video understanding and Q&A.

🔥 TL;DR 🔥

  • We set up a pipeline to query a custom video using Qwen2-VL and shared the code for easy, quick setup.
  • Qwen2-VL achieves state-of-the-art results on image understanding benchmarks (MathVista, DocVQA, RealWorldQA).
  • It processes videos over 20 minutes.
  • Multilingual support includes English, Chinese, Japanese, Korean, Arabic, and most European languages for text recognition in images.
  • Claude 3.5 Sonnet and GPT-4o can process images (i.e., frames) but cannot process videos like Qwen2-VL.

Learn about the cutting edge of multimodality and foundation models in our CVPR 2024 series:

⭐️ Don’t miss our Segment Anything Model (SAM 2) series:

  • SAM 2 + GPT-4o Cascading Foundation Models via Visual Prompting: Part 1 and Part 2.

Table of Contents

1. Introduction to Qwen2-VL and Vision Language Models

2. Qwen2-VL for Video Understanding and Q&A

3. What’s next?

1. Introduction to Qwen2-VL and Vision Language Models

1.1 Brief overview of Vision Language Models

Vision Language Models (VLMs) bridge the gap between visual and textual information. Unlike traditional models trained for specific tasks, VLMs are designed for versatility across various vision-language applications (see Figure 1).

Figure 1. VLM’s strong visual component makes them ideal for practical vision-related tasks

VLMs are trained on massive datasets of image-text pairs to learn the correspondence between visual and textual elements. They employ different learning strategies:

  • Contrastive Learning: Distinguish between matching (positive) and mismatched (negative) image-text pairs. Example: CLIP [2].
  • Masking Objectives: Predict masked visual or textual elements based on the visible context. Example: FLAVA [3], MaskVLM [4].
  • Generative Modeling: Generate images from text descriptions or vice versa. Example: CoCa [5], CM3leon [6].
  • Pretrained Backbones: Utilize pre-trained language models (LLMs) like Llama to map image features to language representations. Example: MiniGPT [7].

On the other hand, VLM evaluation relies on benchmarks that assess their ability to connect visual and textual information such as:

  • i) Vision-linguistic benchmarks measure the accuracy of image captioning, text-to-image consistency, visual question answering, zero-shot image classification, and visual reasoning.
  • ii) Hallucination benchmarks evaluate the tendency of VLMs to generate incorrect or irrelevant text based on images.

1.2 Qwen2-VL: Purpose and Key Features

Qwen2-VL was built on the foundation of Qwen2 language models and aims to achieve state-of-the-art performance in understanding and interacting with both images and videos.

Qwen2-VL’s three primary features are:

  1. Advanced Image and Video Understanding: Qwen2-VL excels at comprehending images and videos. It can analyze images of varying resolutions and ratios, exceeding previous models’ limitations. It can understand videos exceeding 20 minutes, enabling it to answer complex questions.
  2. Multilingual Proficiency: Qwen2-VL supports understanding text within images in multiple languages. It’s capable of comprehending English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, and more.
  3. Multimodal Agent Capabilities: Qwen2-VL goes beyond passive understanding to become an active agent, capable of interacting with the world through visual cues and instructions:
  • Function Calling: This model can use external tools for real-time data retrieval, responding to user queries by extracting information from visual sources like analyzing flight statuses in an image or checking weather forecasts based on a picture. 🛫 🌧
  • Visual Interactions: Qwen2-VL takes a step towards mimicking human-like perception, allowing interaction with visual stimuli in a manner similar to how humans perceive the world. This opens up possibilities for more intuitive and immersive interactions, where the model can participate actively in visual experiences. 👓 🖼

Function calling refers to the capability to define and describe calls to external application programming interfaces (APIs). Microsoft

1.3 How Qwen2-VL Stands Out in the Current AI Landscape

In previous posts we have previously compared side by side (i.e., using both simple and challenging prompts for image understanding) multimodal models. Here, we’ll simply highlighted where Qwen2-VL excels.

Figure 2 shows how Qwen2-VL distinguishes from GPT4o and Claude 3.5 Sonnet mostly on its visual understanding both for video and for high-resolution images.

Figure 2. Comparison between Qwen2-VL and other frontier models

2. Qwen2-VL for Video Understanding and Q&A

Note: be aware that Colab’s free version might only work for Qwen2-VL 2B model. Qwen2-VL 7B will drain Colab’s free GPU RAM memory.

💻 You can run the set up and do inference in no time with this Jupyter Notebook we have created for you. 😎

🤖 We’ll use the video above, to ask questions such as:

  • In which city or country is the event happening?
  • What is the outfit’s colour of the gymnast?

2.1 How to set up Qwen2-VL

To set up Qwen2-VL you need:

  • A specific version of the transformers library
  • A utilities library to interact with Qwen2-VL

pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate

pip install qwen-vl-utils

Next, Qwen2-VL can do inference either by analyzing the frames of a video or by ingesting the entire video itself. Hence, we need a library to extract the frames of a video.


# For Colab Notebooks
pip install ffmpeg -q # For a Colab Notebook

# For a Jupyter Notebook on your own GPU with a conda environment
conda install -c conda-forge ffmpeg -y 

We load the model, in our case Qwen2-VL 2B. You might want to check Qwen2-VL 7B and a 72B option (what!) that is available through API only.


from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

Finally, the following function wraps a call to perform inference.


def query_video(prompt, use_frames=True, frames_path="/home/qwen2_vl/content/frames", video_path=None):
    if use_frames:
        # Get the frames
        selected_frames = get_frame_list(output_path)

        # Create messages structure for frames
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "video": selected_frames,
                        "fps": 1.0,
                    },
                    {"type": "text", "text": prompt},
                ],
            }
        ]
    else:
        # Create messages structure for the entire video
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "video": f"file://{video_path}",
                        "max_pixels": 360 * 420,
                        "fps": 1.0,
                    },
                    {"type": "text", "text": prompt},
                ],
            }
        ]

    print(f"Using {'frames' if use_frames else 'entire video'} for inference.")

    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Inference
    with torch.no_grad():  # Use no_grad to save memory during inference
        generated_ids = model.generate(**inputs, max_new_tokens=128)

    # Trim the generated output to remove the input prompt
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Decode the generated text
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    print(output_text)
    torch.cuda.empty_cache()

2.2 Video understanding (using frames)

Figure 4 illustrates the answers of Qwen2-VL using frames.

Figure 4. Qwen2-VL (frame-level mode) was queried with “Describe the video in detail”

Other questions asked were:

  • User: What is the outfit’s colour of the gymnast?
  • Qwen2-VL: “The gymnast is wearing a white outfit with red and black stripes.”
  • User: In which city or country is the event happening?
  • Qwen2-VL: “The event is taking place in Paris, France.”

The responses appear to be pretty accurate, especially considering that we’re using the 2B parameter model option.

Despite being apples-oranges 🍎 🍊, what happens if we ask GPT-4o or Claude 3.5 Sonnet for the same questions?

Figure 5. Claude 3.5 Sonnet provided a response similar to Qwen2-VL (frame-level).

In short, as Figure 5 shows, Claude 3.5 Sonnet seems to be as accurate as Qwen2-VL for one single frame.

However, what if you want to ask for specific queries that take place in the middle/end of the video? 🤔 Well, here’s where Qwen2-VL shines: you simply ask the question! GPT-4o and Claude Sonnet 3.5 are unable to ingest/process video (at least not yet). 📹

Google has a couple of early stage solutions for video understanding but we’ll leave them for a different blog post.

2.3 Video understanding (using the entire video)

As shown in Figure 6, non surprisingly Qwen2-VL is also pretty accurate. It provides even more details than the frame-level version. Given that the model’s input is now the entire video, it has more information to provide a more detailed response.

Figure 6. Qwen2-VL (entire video mode) provided a more detailed answer than the frame-level alternative when queried with “Describe the video in detail”.

3. What’s next?

In this article we set up Qwen2-VL, a Vision Language model that can be used for video understanding.

Video search and understanding is an exciting area that is ripe for disruption. As we pointed out in our series Scalable Video Search: Cascading Foundation Models — Part 1, video is everywhere but the infrastructure to do robust search at scale is still under construction.

References

[1] Qwen2 Technical Report

[2] Learning Transferable Visual Models From Natural Language Supervision

[3] FLAVA: A Foundational Language And Vision Alignment Model

[4] Masked Vision and Language Modeling for Multi-modal Representation Learning

[5] CoCa: Contrastive Captioners are Image-Text Foundation Models

[6] Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

[7] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Authors: Dmitry Kazhdan, Jose Gabriel Islas Montero

If you’d like to know more about Tenyks, try sandbox.

Stay In Touch
Subscribe to our Newsletter
Stay up-to-date on the latest blogs and news from Tenyks!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Newsletter

Lorem ipsum dolor sit amet

Lorem ipsum dolor sit amet

Reach Super-Human Model Performance at Record Breaking Speed!

Figure out what’s wrong and fix it instantly
Try for Free