Popular Downstream Tasks for Video Representation Learning

Original Source Here

Popular Downstream Tasks for Video Representation Learning

A summary of the common downstream tasks used to evaluate video representation learning for long, instructional videos.

What is Representation Learning?

Representation learning is an area of research that focuses on how to learn compact, numerical representations for different sources of signal. These signals are most often video, text, audio, and image. The goal of this research is to use these representations for other tasks, such as querying for information. A well-known example of this is when searching for videos on YouTube: a user provides text keywords resulting in YouTube returning a set of videos most similar to those words.

In computer vision literature, representations are learned by training a deep learning model to embed (or transform) raw input to a numerical vector. When embedding videos, audio, or text, the numerical vectors are often multi-dimensional to maintain temporal relationships. The ways in which researchers train these model varies drastically. The downstream tasks are how these methods are evaluated and are the focus of this article.


Many of the papers that I have read use the dataset HowTo100M to train the model. This corpus contains a total of 1.2 million videos with a range of 23k activities! The types of activities are diverse, ranging from cooking to crafting to fitness to gardening, and much much more. This dataset is HUGE, and would take a long time for the average researcher to train a model with. However, if you have the computation power, it is a great dataset to use for representation learning. Each video contains a short description of the task and a set of captions that were automatically generated. From my experience, and the experience of the original researchers, the captions are very noisy with alignment issues coupled with inaccurate audio-to-text translations. The short task descriptions are not ALWAYS accurate or are extremely general. However, this is because of the YouTube extraction, not the fault of the researchers.

For each downstream task there are datasets that have annotations specific for the task’s evaluation. These are datasets that have a smaller set of videos focused on a smaller set of activities.


Datasets that are great for text-related video tasks are:

YouCook2 is a cooking based dataset of 2K untrimmed videos of 89 cooking recipes with step-by-step annotations. This dataset also includes temporal boundaries that are useful for temporal targeted tasks. MSR-VTT is a more general dataset of 10K video clips with 200K human annotated clip-caption pairs on 257 different topics. While not specifically instructional, the LSMDC dataset contains 101k unique video clip-caption pairs with descriptions coming from either the movie script or audio description.

Datasets more commonly used for action-related video tasks are:

CrossTask contains 2.7k instructional videos with step-by-step action annotations for each frame. COIN is also a general instructional video dataset with 11,827 videos showing 180 tasks with annotations associated with the actions that occur during the video.

Downstream Tasks

Video/Action Classification


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: