In Part 1 of this article we introduce Segment Anything Model 2 (SAM 2). Then, we walk you through how you can set it up and run inference on your own video clips.
🔥 Learn more about visual prompting and RAG:
- CVPR 2024: Foundation Models + Visual Prompting Are About to Disrupt Computer Vision
- RAG for Vision: Building Multimodal Computer Vision Systems
Table of Contents
- What is Segment Anything Model 2 (SAM 2)?
- What is special about SAM 2?
- How can I run SAM 2?
- What’s next
1. What is Segment Anything Model 2 (SAM 2)?
TL;DR:
SAM 2 can segment objects in any image or video without retraining.
Segment Anything Model 2 (SAM 2) [1] by Meta is an advanced version of the original Segment Anything Model [2] designed for object segmentation in both images and videos (see Figure 1).
Figure 1. A pedestrian (blue mask) and a car (yellow mask) are segmented and tracked using SAM 2
Released under an open-source Apache 2.0 license, SAM 2 represents a significant leap forward in computer vision, allowing for real-time, promptable segmentation of objects.
SAM 2 is notable for its accuracy in image segmentation and superior performance in video segmentation, requiring significantly less interaction time compared to previous models: we show how SAM 2 required 3 points to segment objects across an entire video!
Meta has also introduced the SA-V dataset alongside SAM 2, which features over 51,000 videos and more than 600,000 masklets. This dataset facilitates its application in diverse fields such as medical imaging, satellite imagery, marine science, and content creation.
1.1 SAM 2 features summary
The main characteristics of SAM 2 are summarized in Figure 2.
2. What is special about SAM 2?
What’s novel about SAM 2 is that it addresses the complexities of video data, such as object motion, deformation, occlusion, and lighting changes, which are not present in static images.
This makes SAM 2 a crucial tool for applications in mixed reality, robotics, autonomous vehicles, and video editing.
Figure 3. SAM 2 in action: the ball is removed from the original video (top left), and a new video with no ball is created (bottom right) (Source)
SAM 2’s key innovations are:
- Unified Model for Images and Videos: SAM 2 treats images as single-frame videos, allowing it to handle both types of input seamlessly. This unification is achieved by leveraging memory to recall previously processed information in videos, enabling accurate segmentation across frames.
- Promptable Visual Segmentation Task: SAM 2 generalizes the image segmentation task to the video domain by taking input prompts (points, boxes, or masks) in any frame of a video to define a spatio-temporal mask (masklet). It can make immediate predictions and propagate them temporally, refining the segmentation iteratively with additional prompts.
- Advanced Dataset (SA-V): SAM 2 is trained on the SA-V dataset, which is significantly larger than existing video segmentation datasets. This extensive dataset enables SAM 2 to achieve state-of-the-art performance in video segmentation.
3. How can I run SAM 2?
You can either check SAM 2 repository or setup your model on your own machine using this Jupyter Notebook. In this section we describe the latter approach.
3.1 Pre-requisites
- A machine with a GPU
- A library to extract frames from a video (e.g., ffmpeg)
3.2 Setup
3.3. Download SAM-2 checkpoints
We’ll only download the largest model but there are smaller options available too.
3.4 Create a predictor
3.5 Extract the frames from your video and explore the data
3.6 Define the objects to segment using coordinates
We define a function to help us provide a list of x, y coordinates:
We establish the state and provide the coordinates of the objects we aim to segment:
As shown in Figure 5, three points were enough for the model to assign a mask to the whole body of the individual.
Now we run the process on all the frames (Figure 6):
Finally, we combine the frames to generate a video using ffmpeg. The end result is shown in Figure 7.
Figure 7. Top: original video, Bottom: video after running SAM 2 on it
4. What’s next
SAM 2’s ability to segment objects accurately and quickly in both images and videos can revolutionize how computer vision systems are created.
In Part 2 we’ll explore how we can use GPT-4o to provide visual prompts to SAM 2 in what we call a cascade of foundation models, meaning, chaining models together to create the vision systems of the future.
🔥 Learn more about the cutting edge of multimodality and foundation models in our CVPR 2024 series:
- Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more).
- Top Highlights You Must Know — Embodied AI, GenAI, Foundation Models, and Video Understanding.
References
[1] Segment Anything Model 2
[2] Segment Anything Model
Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan
👉 If you would like to know more about Tenyks, try sandbox.