Segment Anything Meets Point Tracking (arXiv:2307)
TL;DR
- This paper introduces SAM-PT, a point-centric interactive video segmentation model, empowered by SAM and long-term point tracking.
- However, the performance remains behind the existing SOTA methods while the methods are somewhat complicated and heuristic.
Motivation
The authors of this paper maybe wanted to leverage image segmentation foundation model, namely SAM, into video segmentation domain. To do this, point tracking method CoTracker are utilized.
Method
SAM-PT
SAM-PT works in four steps:
- select query points for the first frame.
- propagate the points to the entire video frame using the tracker module.
- generate segmentation masks using SAM for the each frame independently.
- using the predicted masks, reinitialize the process by sampling query points
Query Points Selection
What we have to do first is to generate multiple points. Since this paper focuses on the interactive VOS and semi-supervised VOS, basically there are two ways to generate the inital points.
In the case of semi-supervised VOS, multiple points are sampled using various methods (See Fig. 3). While for interactive point-based VOS, they just used the user input points.
Point Tracking
The points are then propagated across the entire frame of the video, employing off-the-shelf point tracker modules such as PIPS or CoTracker.
Segmentation
To prompt the SAM, positive and negative points are combined and feed to the SAM. Actually, this model works in two passes, where the first pass is what we know, and the second pass is getting points from the last step of the first pass. In the second pass, negative points provide a nuanced distinction between the object and the background. This second pass is executed a variable number for mask refinement process.
Point Tracking Reinitialization
Reinitialization process is executed optionally, once a prediction of $h=8$ frames is done. Then all the previous points are discarded and new points are generated based on the last segmentation mask.
SAM-PT vs. Object-centric Mask Propagation
The comparisons between the SAM-PT and previous methods are reported on Tab. 1.
iDeA: Why do we have to use this method despite of the inferior performances comparing to the previous methods?
Experiments
This section is remain blank intentionally.
Discussion
- The methods are particularly interesting and sound promising. However, it is still not persuasive why we need to use this multi-step, complicated, and somewhat heuristic method.