Imitation Learning from Video Demonstrations

video를 통해서 imitation learning을 하는 방식은 방식에 따라 두 category로 나눌 수 있다:

video 내의 structure 정보 활용 (explicit) vs. video의 latent representation 정보 활용 (implicit)
internet-scale action video 활용 vs. demonstration video 활용

Categorization by Information Type

structure 정보를 활용하는 경우 (explicit)
- 주로 vision expert model의 tracking, detection, contact point, affordance 등의 prior knowledge를 leverage하는 방식으로 디자인된다.
- 주로 추출한 정보를 통해 다음 state를 예측하거나 goal을 설정하는 데에 supervised learning으로 활용한다.
- pose estimation보다는 contact point나 tracking 정도의 더 abstract한 정보를 활용하는 경우가 많다.
  - 이는 human-agent OR agent-agent 간의 embodiment의 morphological difference로 인한 것으로 생각된다.
representation 정보를 활용하는 경우 (implicit)
- state로서 video의 여러 가지 feature를 이용하여 unsupervised learning하는 경우가 있다.
- 최근에는 Diffusion이 많이 사용된다.
  - 초기 state로부터 goal state를 예측하는 데에 사용
  - pixel level에서의 goal state를 예측하는 데에 사용
  - human demonstration video로부터 robot demonstration video로 바꾸는 데에 사용

Categorization by Video Type

large-scale human action video를 사용하는 경우
- 주로 label이 없는 human egocentric action video dataset (Ego4D, EPIC-Kitchens) 을 활용한다.
- third-person view video를 사용하는 경우는 매우 드문 것으로 보인다.
- 해당 dataset들은 action에 대한 text labeling만 있기 때문에 action에 대한 supervised learning을 하기에는 제한된다.
- 따라서 vision-expert model을 이용한 pseudo-label을 만들거나, world-knowledge를 얻기 위한 unsupervised learning에 사용된다.
- 이후 downstream task를 위한 robot dataset으로의 finetune은 필요한 것으로 보인다.
demonstration video를 활용하는 경우
- demonstration video라 함은 로봇이 하려는 task에 대해서 사람을 비롯한 다른 agent의 시연 장면을 보여주거나, video-action pair가 구성된 데이터셋을 활용하는 경우를 의미하는 것으로 보인다.
- 이 경우에도 방법론은 크게 다르지 않다.
- 다만 dataset 크기가 작으므로 latent representation보다는 image-level에서의 feature를 사용하는 경우가 더 많은 것으로 보인다.

Paper List

사람의 동작에서 explicit feature extraction하는 방식 vs. implicit feature extraction하는 방식
demonstration dataset을 쓰는 방식 vs. 단순한 action video를 쓰는 방식
- demonstration video는 video와 corresponding action label이 주어지는 경우를 의미한다.

위 방식을 기준으로 설명하며, 이외에 사람이 video에 대해 manual하게 video-action pair를 제작한 dataset도 함께 소개한다.

tags
- ego: human egocentric video (Ego4D, EPIC-Kitchen) 활용
- demo: demonstration video (action label이 있는 video) 활용
- pose, track, contact point, diffusion: 방법론 관련
Explicit Feature Extraction: 이 방식의 주요 obstacle은 robot과 사람의 morphological difference이다. 이를 해결하기 위해 demonstration video에서 직접 pose를 추출하기보다는 주로 trajectory나 contact point, affordance 등을 추출하여 사용하는 방법론이 많다. 추출 과정에서는 vision expert model들의 prior knowledge를 사용한다.
- DevMV (ECCV 2022) ego pose
  - Human egocentric video에서 손을 사용한다.
  - 3D hand pose estimation 한 뒤 물체와 finger tip과 palm의 방향을 활용한다.
- Vision-Robotics Bridge (CVPR 2023) ego contact point track
  - Human egocentric video에서 contact point를 활용한다.
  - 사람이 task를 해결하기 위해 contact한 point와 그 후의 trajectory를 활용한다.
- SWIM (RSS 2023) ego + demo contact point track
  - Human egocentric video를 pretraining에, robot trajectory video를 finetuning에 사용한다.
  - egocentric video에서 contact point와 trajectory로 world model이 다음 state하도록 unsupervised learning한다.
- Track2Act (ECCV 2024) demo track diffusion
  - Human action video와 robot dataset을 활용한다.
  - Diffusion model이 initial image와 goal image를 가지고 point trajectory를 generate하도록 train한다.
- VIEW (Autonomous Robots 2025) demo track
  - Third person human demonstration video를 사용한다.
  - hand와 object의 trajectory를 추출한 뒤 keypoint를 찾아 MSE를 loss로 활용한다.
- ARM4R (arXiv 2502) ego + demo track
  - Human egocentric video로 pretrain하고 demonstration video로 finetune한다.
  - video의 2D point를 3D로 lifting한 후 tracking하여 state로 활용한다.
Representation Extraction: video 내의 structure 정보를 얻기보다는 video의 representation을 통해서 action을 생성한다.
- GAIL (NeurIPS 2016) demo → citation 4040
  - expert로부터의 demonstration으로 학습할 때의 reward model을 GAN의 구조를 차용한 방식이다.
  - policy는 RL으로 구성하나 reward는 expert인지 agent인지 구별하는 discriminator로부터 획득된다.
- XIRL (CoRL 2021 Oral) action-free demo representation
  - 다른 agent(human)들의 action label 없는 demonstration video를 활용한다.
  - 서로 다른 embodiment로 수행된 demonstration video에서 task progress embedding을 학습한다.
  - 서로 다른 video에서 같은 수준의 progress를 가진 frame이 align될 수 있도록 embedding space를 학습한다.(Temporal Cycle Consistency)
  - 시연 영상들의 마지막 frame과의 latent space distance를 이용하여 loss를 계산한다.
- R3M (CoRL 2022) ego representation
  - Human egocentric image와 caption을 이용한다.
  - temporal distance, VL alignment를 이용하여 image encoder를 unsupervised learning한다. R3M encoder는 spatial, temporal understanding 능력을 획득한다.
  - 이 image encoder를 freeze한 상태에서 downstream task에 활용한다.
- DIFO (NeurIPS 2024) demo diffusion
  - GAIL의 Discriminator를 Diffusion으로 바꾼 모델이다.
- Human2Robot (arXiv 2502) demo diffusion
  - H&R dataset (아래 설명)으로 train한 diffusion model을 활용한다.
  - 사람의 행동 video에 대해서 diffusion으로 corresponding robot video 및 action을 generation한다.
Dataset
- Human2Robot (arXiv 2502)
  - H&R Dataset을 제안한다.
  - 2,600개의 episode
  - 사람과 robot이 같은 행동을 하는 pair video dataset이다.
  - frame-wise로 align되어 있다.

Imitation Learning from Video Demonstrations#

Categorization by Information Type#

Categorization by Video Type#

Paper List#

Imitation Learning from Video Demonstrations

Categorization by Information Type

Categorization by Video Type

Paper List