Method

fig2

non-overlapping binary mask sequence $\hat y = \{\mathcal{M}_i \in \mathbb R ^{T\times H \times W} \}^k_{i=1}$ 를 먼저 만든다.

$$\hat y = \phi_θ(x_v,q)$$

각 output mask sequence $\mathcal{M}_i = \{m_{i,t} \in \mathbb R^{H\times W}\}^T_{t=1}$는 각 object instance에 correspond하는 것이다.
- $m_{i,t} \cap m_{j,t} = \emptyset \quad \text{for } i\neq j$ 을 전제한다.¹

Multi-Agentic Framework

fig3

closed source MLLM으로 keyframe을 select한다.
- GPT-4o, Gemma3를 활용한낟.
- keyframe-related object description, respective keyframe, object index를 만든다.
CoT process를 통해서 question을 만들고, object를 찾는다.

주어진 video $x_v$와 user query $q$에 대해서 $T'$ keyframe candidates를 uniform sampling한다.
keyframe candidates에서 refine된 instance-level text description $s$과 corresponding keyframe $f$을 얻는다.

$$I^c_{1:T'} = \text{Sample}(I_{1:T}, ξ)$$$$s_{1:k},f_{1:k} = \mathcal{F}_{key}(q, I^c_{1:T'})$$

여기서 reasoning segmentation model $\mathcal{F}_{seg}$를 사용해서 per-instance key masks $\{\tilde m _i \in \mathbb R^{H\times W}\}$를 생성한다.
- 이는 SAM2를 통해 track한다.

$$\tilde m_i = \mathcal{F}_{seg} (s_i, f_i); \quad \hat M_i = \mathcal{F}_{vid}(\tilde m_i, x_v) \quad \text{for} \quad i =1,2,\dots,k$$

$$m_{1,t}=\hat m _{1,t}$$$$m_{1,t}=\bigcap{ ^{i-1}_{j=1}\lnot m_{j,t}\cup \hat m_{i,t}}$$

$$m_t = \psi_θ(I_{1:t}, m_{1:t-1}, q)$$

fig4

fig5

tab1-2

tab3-4

tab5-7

fig6-7

요즘 Reasoning VOS의 질문들은 특히나 너무 어려워서 text 대답을 먼저 내놓고 푸는 건 효과가 있을 수 있다.
- 이걸 GPT한테 시키는 건 좋은 발상
- but end-to-end 단에서의 고민도 필요하긴 할듯
그에 대한 reasoning step guidance가 포함된 dataset이 없는 부분은 한계이다.