Learning Goals from Failure

Columbia University

Paper Data Code

Can you guess this person's goal? Hover or tap to check your answer!

Although only the action is observable, you were probably still able to predict the goal behind this video - opening a wine bottle. In this paper, we introduce a model to learn goal-oriented video representations in the form of latent action trajectories.

We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features.

Experiments and visualizations show our trained model is able to predict the underlying goals in video of unintentional action. We also propose a method to ``automatically correct'' unintentional action by leveraging gradient signals of our model to adjust latent trajectories. Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video.

Paper

arXiv PDF

@article{epstein2020video,
  title={Learning Goals from Failure},
  author={Epstein, Dave and Vondrick, Carl},
  journal={arXiv preprint arXiv:2006.15657},
  year={2020}
}

Results

Goal predictions

Failure recognition

Decoding learned trajectories: We freeze our trained model and train a linear decoder to describe the goals of action, outputting subject-verb-object (SVO) triples. The decoder predicts the goals of video when the action shown is intentional (lefttop) and predicts unintentional failures when they appear (rightbottom).

Action autocorrect

Input

Autocorrect retrieval

Input

Autocorrect retrieval

Input

Autocorrect retrieval

Retrieval from auto-corrected trajectories: We show the nearest neighbors from auto-corrected action trajectories, using our proposed adversarial method. The retrievals are computed across both the Oops! and Kinetics datasets. The corrected representations yield corrected trajectories that are often embedded close to action depicting the same high-level goal's successful completion.

Analyzing the representation

Neuron 506 ⟷ fall (r=0.198)

Neuron 370 ⟷ jump (r=0.182)

Neuron 61 ⟷ flip (r=0.143)

Unsupervised action neurons emerge: We show the neurons with highest correlation to the words in the subject-verb-object vocabulary, along with their top-5 retrieved clips. Neurons that detect intentions across a wide range of action and scene appear to emerge, despite only training with binary labels on the intentionality of action.

Data

Goal: Two

men

were trying to

put

turkey

in a vat of hot oil.
Failure: The

men

dropped

the

turkey

aggressively and oil overflowed onto the grass.

Goal: A

man

was trying to

cross

the

road

on green as a car was coming towards him.
Failure:

fumbled

his

shopping

and his crate emptied out all over the sidewalk.

Goal:

Someone

is trying to

carry

grill

down the patio stairs.
Failure: The

guy

lost control

of the

grill

since it was too heavy and the wheels slipped.

Goal: The

woman

just wanted to

sled

down the snowy

hill

.
Failure: The

dog

was

bothering

her

the entire time and just wanted to play.

We collect unconstrained natural language descriptions of a subset of videos in the Oops! dataset, prompting Amazon Mechanical Turk workers to answer “What was the goal in this video?” as well as “What went wrong?”. We then process these sentences to detect lemmatized subject verb object triples, manually correcting for common constructions such as “tries to X” (where the verb lemma is detected as “try”, but we would like “X”). The final vocabulary contains 3615 tokens. The obtained SVO triples can be used to evaluate video representations of goals. We show some examples to the leftabove.

Explore dataset

Architecture and code

GitHub

Acknowledgements

Funding was provided by DARPA MCS, NSF NRI 1925157, and an Amazon Research Gift. We thank NVidia for donating GPUs.