InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

published: Jan 2025

fig1

Abstract

fig2

length-adaptive token representation approach를 제안함.
- dynamic frame sampling
- Hierarchical token Compressions (HiCo)
  - spatiotemporal-aware compression
  - adaptive multimodal context consolidation
Adaptive Temporal Sampling
- 이름은 거창한데, short video는 15fps, long video는 1fps로 sampling한 것이다.
Hierarchical Token Compression
- video를 $T$개 temporal segment로 나누고
- 각 segment를 M개의 token으로 만든다.
- 이걸 다시 adaptive compression으로 N개로 줄인다. (N<M)
- 여러 pooling method 중 semantic similarity-based token merging (ToMe) ¹이 가장 좋았음.
Multimodal Token Dropout
- two-phase token reduction함.
  - uniform token pruning in early layers
  - attention-guided token selection in deeper layers
- 각 token에 대해 token preservation probability를 정해두고 Bernoulli sampling해서 keep, discard를 결정한다.

→ Discussion 1 참조

compressed video token을 random dropping하는거 흥미롭다.
- 일종의 VideoMAE처럼 masking & training으로 봐도 될 듯
RVOS를 너무 못함.