SAMURAI

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

domain: tracking

fig1

SAM2에는 전형적인 failure case가 두 개 있음 (Fig. 1).
- confusion in crowded scene
- ineffective memory utilization during occlusions
이를 해결하기 위해 다음 두 방법을 사용함.
- 따라서 object trajectory를 활용함.
- similar object differentiate ability를 향상함.

Method

fig2

Kalman filter로 다음 위치를 예측한 뒤 prediction과 SAM output mask의 IoU를 구함.
SAM output affinity와 함께 weighted average로 mask를 선택함.
Kalman filter modeling
- dot 은 변화량을 의미함 $$\mathbf{x} = [x, y, w, h, \dot{x}, \dot{y}, \dot{w}, \dot{h}]^T$$
prediction $$\hat{\mathbf{x}}_{t+1|t} = F \hat{\mathbf{x}}_{t|t}$$
KF-IoU score $$s_{\text{kf}} = \text{IoU}(\hat{\mathbf{x}}_{t+1|t}, M_i)$$
mask selection $$M^* = \arg\max_{M_i} \left( \alpha_{\text{kf}} \cdot s_{\text{kf}}(M_i) + (1 - \alpha_{\text{kf}}) \cdot s_{\text{mask}}(M_i) \right)$$
update $$\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + K_t (z_t - H \hat{\mathbf{x}}_{t|t-1})$$