Space-Time Representation and Recognition in Computer Vision
Space-Time Representation and Recognition in Computer Vision
Disciplines
Computer Sciences (100%)
Keywords
-
Computer Vision,
Space-Time Representation,
Dynamic Scene Recognition,
Action/Activity Recognition,
Space-time object category model
This project shall investigate spatial and temporal relations in video in a principled manner. We will develop a novel representation, called ``space-time volume of interest`` VOI, that will be used consistently to represent various entities in video data, for instance independent foreground motion, background motion patterns, and repetitive patterns in space-time. These entities will be described by extended space-time oriented energy filters, and the descriptions will be utilized to categorize video textures, dynamic scenes, types of camera motion, independently moving foreground objects, and activities. The basic, individual representational units, i.e., individual VOIs, shall be integrated to complex, compound space- time models that cast votes for particular spatial and temporal locations as well as scales. The project is coarsely structured into four work packages, addressing the VOI representation itself, extraction of local space-time descriptors, learning and long-term space-time description, and online detection/categorization in videos. With this research project, we expect to contribute to a better representation and understanding of the ``deep structure`` of visual space-time. The novel concepts developed in this project will initiate a variety of further basic research in this field, as well as encourage new applications in video analysis, surveillance, and autonomous systems.
What happens where and when in a video? How to track, represent, and classify objects, and actions? What are major applications of such novel methods? The project has addressed these key questions from August 2014 July 2017. In these three years, the field of Computer Vision has seen a paradigm shift from previously used explicit modeling and specific algorithms to now popular implicit representation in deep, convolutional neural networks (deep ConvNets). These new methods largely outperform all previous algorithms, with a general increase in system performance and reliability that puts significant applications within reach. This paradigm shift includes a shift in research focus towards architectures of deep networks, machine learning, and image- and video-databases for training. We have achieved several remarkable results, in close collaboration with York University, Toronto (explicit representation of space-time using space-time oriented energies) and with University of Oxford (implicit representation in ConvNets, in particular novel two-stream architectures that can model and combine appearance and motion in video). Our results include the release of a new benchmark video dataset for dynamic scene recognition, and a number of novel ConvNet architectures, which we trained to establish a new state-of-the-art in dynamic scene recognition, human action recognition, and object detection and tracking in video. The focus of this basic research project has mainly been on the development of novel methods. We succeeded in publishing an outstanding number of papers in top ranked Computer Vision conferences and journals. Beyond these purely scientific achievements, there is a number of highly relevant applications of our methods in areas such as autonomous driving, automated video analysis and annotation, video surveillance, and many more. Towards the end of the project, we also achieved a major breakthrough towards better understanding and analyzing of what is happening inside deep neuronal networks. For the first time, we now can visualize the learned representations of the motion stream in ConvNets, which shows us what has been learned and how it is represented in which units at which layers in the net. This also leads to an improved analysis of problems and failure cases.
- Technische Universität Graz - 100%
- Richard Wildes, York University - Canada
Research Output
- 3888 Citations
- 18 Publications
-
2019
Title Deep Insights into Convolutional Networks for Video Recognition DOI 10.1007/s11263-019-01225-w Type Journal Article Author Feichtenhofer C Journal International Journal of Computer Vision Pages 420-437 Link Publication -
2018
Title What have we learned from deep representations for action recognition? DOI 10.48550/arxiv.1801.01415 Type Preprint Author Feichtenhofer C -
2017
Title Spatiotemporal Multiplier Networks for Video Action Recognition DOI 10.1109/cvpr.2017.787 Type Conference Proceeding Abstract Author Feichtenhofer C Pages 7445-7454 -
2017
Title Temporal Residual Networks for Dynamic Scene Recognition DOI 10.1109/cvpr.2017.786 Type Conference Proceeding Abstract Author Feichtenhofer C Pages 7435-7444 -
2016
Title Spatiotemporal Residual Networks for Video Action Recognition DOI 10.48550/arxiv.1611.02155 Type Preprint Author Feichtenhofer C -
2016
Title Convolutional Two-Stream Network Fusion for Video Action Recognition DOI 10.48550/arxiv.1604.06573 Type Preprint Author Feichtenhofer C -
2016
Title Spatiotemporal Residual Networks for Video Action Recognition. Type Conference Proceeding Abstract Author Feichtenhofer C Conference Proc. NIPS, 2016 -
2018
Title What have we learned from deep representations for action recognition? DOI 10.1109/cvpr.2018.00818 Type Conference Proceeding Abstract Author Feichtenhofer C Pages 7844-7853 Link Publication -
2017
Title Detect to Track and Track to Detect DOI 10.1109/iccv.2017.330 Type Conference Proceeding Abstract Author Feichtenhofer C Pages 3057-3065 Link Publication -
2017
Title Detect to Track and Track to Detect. Type Conference Proceeding Abstract Author Feichtenhofer C Conference Proc. ICCV, 2017 -
2017
Title Temporal Residual Networks for Dynamic Scene Recognition. Type Conference Proceeding Abstract Author Feichtenhofer C Conference Proc. CVPR, 2017 -
2017
Title Spatiotemporal Multiplier Networks for Video Action Recognition. Type Conference Proceeding Abstract Author Feichtenhofer C Conference Proc. CVPR, 2017 -
2017
Title Detect to Track and Track to Detect DOI 10.48550/arxiv.1710.03958 Type Preprint Author Feichtenhofer C -
2016
Title Dynamic Scene Recognition with Complementary Spatiotemporal Features DOI 10.1109/tpami.2016.2526008 Type Journal Article Author Feichtenhofer C Journal IEEE Transactions on Pattern Analysis and Machine Intelligence Pages 2389-2401 -
2016
Title Convolutional Two-Stream Network Fusion for Video Action Recognition DOI 10.1109/cvpr.2016.213 Type Conference Proceeding Abstract Author Feichtenhofer C Pages 1933-1941 Link Publication -
2016
Title Convolutional Two-Stream Network Fusion for Video Action Recognition. Type Conference Proceeding Abstract Author Feichtenhofer C Conference Proc. CVPR, 2016 -
2015
Title Dynamically Encoded Actions Based on Spacetime Saliency DOI 10.1109/cvpr.2015.7298892 Type Conference Proceeding Abstract Author Feichtenhofer C Pages 2755-2764 -
2015
Title Dynamically Encoded Actions based on Spacetime Saliency. Type Conference Proceeding Abstract Author Feichtenhofer C Conference Proc. CVPR, 2015