Learning based temporal action localization methods require vast amounts of training data. However, such largescale video datasets, which are expected to capture the dynamics of every action category, are not only very expensive to acquire but are …
We address the problem of detailed sequence labeling of complex activities in videos, which aims to assign an action label to every frame. Previous work typically focus on predicting action class labels for each frame in a sequence without reasoning …