Grounding Video Models to Actions through
Goal Conditioned Exploration

1Georgia Tech, 2Brown, 3Harvard

Grounding Video Models to Actions. We learn a goal-conditioned policy to ground video models to executable actions without any action demonstrations or environment rewards by using the synthesized video frames as goal to direct the exploration and training the policy with the collected interaction data. The resulting policy is able to complete a diverse set of tasks by accurately following the subgoals from the synthesized video.


Abstract

Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data is available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

Method Overview. Our approach learns to ground a large pretrained video model into continuous actions through goal-directed exploration in an environment. Given a synthesized video, a goal-conditioned policy attempts to reach each visual goal in the video, with data in the resulting real-world execution saved in a replay buffer to train the goal-conditioned policy.


Policy Rollout over Increasing Video-guided Exploration Episodes

Synthesized Video replay


# of video-guided episodes

Synthesized Video and Corresponding Environment Rollout of Policy Trained with Different Number of Video-guided Exploration Episodes. While the policy struggles to accomplish the task without video-guided exploration (at 0 video-guided episodes), the policy rapidly improves after the video-guided exploration is introduced.

Libero Environment

  • put the red mug on the left plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the red mug on the right plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the white mug on the left plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the white mug on the left plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the chocolate pudding to the left of the plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the chocolate pudding to the right of the plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the red mug on the plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • put the white mug on the plate replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

Qualitative Comparison in Libero Environment. We present environment rollouts of our goal-conditioned exploration policy along with various baselines -- AVDC, Behavior Cloning (BC), and Goal-conditioned Behavior Cloning (GCBC). Our policy, shown in the rightmost column, is able to follow the synthesized video and complete the task. In contrast, BC and GCBC cannot accurately locate the target in the unseen setups; while AVDC can move the end effector close to the mug, it cannot successfully grasp the concave objects and get stuck in some local regions.

Meta-World Environment

  • Hammer replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Assembly replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Door Open replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Door Close replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Handle Press replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

Qualitative Comparison in Meta-World Environment. We present environment rollouts of our goal-conditioned exploration policy along with various baselines. Our Policy can accurately reach each subgoal given by the synthesized video and thus completes the tasks, while GCBC and other baselines fail. This is probably due to that our goal-directed exploration can significantly enhance the policy by focusing the exploration on the task-relevant state space.

iTHOR Visual Navigation Environment

  • Kitchen, Target: Bread replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Living Room, Target: Television replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Bedroom, Target: Blinds replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

  • Bathroom, Target: Mirror replay

    Synthesized Video

    AVDC

    BC

    GCBC

    Ours

Qualitative Comparison in iTHOR Environment. We present environment rollouts of our goal-conditioned exploration policy along with various baselines. Both AVDC and BC tend to predict the wrong actions or stop pre-maturely. Though GCBC is able to perform comparably to our policy, it might require extra action steps (see FloorPlan 301). Our policy is able to successfully reach the target object by correctly following the synthesized video.

Qualitative Results in Calvin Environment

Task:  
replay

Synthesized Video

Policy Rollout
Qualitative Results of Our Goal-conditioned Exploration Policy in Calvin Environment. We present the synthesized video side by side with the policy rollout in the environment. The policy is only trained by the proposed exploration framework, without access to action annotations or environment rewards (including the task success signal).

Conclusion

In this paper, we have presented a self-supervised approach to ground generated videos into actions. As generative video models become increasingly more powerful, we believe that they will be increasingly useful for decision-making, providing powerful priors on how various tasks should be accomplished. As a result, the question of how we can accurately convert generated video plans to actual physical execution will become increasingly more relevant, and our approach points towards one direction to solve this question, through online interaction with the agent's environment.


BibTeX

@misc{luo2024groundingvideomodelsactions,
      title={Grounding Video Models to Actions through Goal Conditioned Exploration}, 
      author={Yunhao Luo and Yilun Du},
      year={2024},
      eprint={2411.07223},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2411.07223}, 
}