Video Representations of Goals Emerge from Watching Failure

Columbia University
Can you guess this person's goal? Hover or tap to check your answer!
Although only the action is observable, you were probably still able to predict the goal behind this video - opening a wine bottle. In this paper, we introduce a model to learn goal-oriented video representations in the form of latent action trajectories.

We introduce a video representation learning framework that models the latent goals behind observable human action. Motivated by how children learn to reason about goals and intentions by experiencing failure, we leverage unconstrained video of unintentional action to learn without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features.

Experiments and visualizations show the model is able to predict underlying goals, detect when action switches from intentional to unintentional, and automatically correct unintentional action. Although the model is trained with minimal supervision, it is competitive with highly-supervised baselines, underscoring the role of failure examples for learning goal-oriented video representations.

Paper

arXivPDF
@article{epstein2020video,
  title={Video Representations of Goals Emerge from Watching Failure},
  author={Epstein, Dave and Vondrick, Carl},
  journal={arXiv preprint arXiv:2006.15657},
  year={2020}
}

Results

Goal predictions
Failure recognition

Decoding learned trajectories: We freeze our trained model and train a linear decoder to describe the goals of action, outputting subject-verb-object (SVO) triples. The decoder predicts the goals of video when the action shown is intentional (lefttop) and predicts unintentional failures when they appear (rightbottom).

Action autocorrect
Input
Autocorrect retrieval
Input
Autocorrect retrieval
Input
Autocorrect retrieval

Retrieval from auto-corrected trajectories: We show the nearest neighbors from auto-corrected action trajectories, using our proposed adversarial method. The retrievals are computed across both the Oops! and Kinetics datasets. The corrected representations yield corrected trajectories that are often embedded close to action depicting the same high-level goal's successful completion.

Analyzing the representation
Neuron 506 ⟷ fall (r=0.198)
Neuron 370 ⟷ jump (r=0.182)
Neuron 61 ⟷ flip (r=0.143)

Unsupervised action neurons emerge: We show the neurons with highest correlation to the words in the subject-verb-object vocabulary, along with their top-5 retrieved clips. Neurons that detect intentions across a wide range of action and scene appear to emerge, despite only training with binary labels on the intentionality of action.

Data

We collect unconstrained natural language descriptions of a subset of videos in the Oops! dataset, prompting Amazon Mechanical Turk workers to answer “What was the goal in this video?” as well as “What went wrong?”. We then process these sentences to detect lemmatized subject verb object triples, manually correcting for common constructions such as “tries to X” (where the verb lemma is detected as “try”, but we would like “X”). The final vocabulary contains 3615 tokens. The obtained SVO triples can be used to evaluate video representations of goals. We show some examples to the leftabove.

Explore dataset

Architecture and code

Acknowledgements

Funding was provided by DARPA MCS, NSF NRI 1925157, and an Amazon Research Gift. We thank NVidia for donating GPUs.