InSitu - Integrated Situated Visual Scene and Natural Language Understanding for Human Robot Interaction
InSitu - Integrated Situated Visual Scene and Natural Language Understanding for Human Robot Interaction
Disciplines
Electrical Engineering, Electronics, Information Engineering (20%); Computer Sciences (50%); Linguistics and Literature (30%)
Keywords
-
Computer Vision,
Cognitive Systems,
Natural language Understanding,
Integration,
Robotics
Recent years have seen major advances in personal and assistive robots (from household robots to robots for elder care). Yet we are still far away from truly natural human-robot interaction in everyday situations and environments. Robust visual scene understanding and natural language understanding on robots are currently two of the major road blocks. We believe that this is in part due to the fact that these two are often treated separately. Note that typical dialogue between humans situated in the same scene is full of cases where vision and dialogue are used jointly to ground a common understanding. Humans tend to look towards an object that they are currently referring to in dialogue, guiding the attention of the dialogue partner. Similarly object attributes extracted from (even partly) parsed utterances like "Could you hand me the red ...`` will guide the search for the respective object. Vice versa, visually observing a scene that is being talked about supports understanding of ambiguous or underspecified utterances while they are being processed - "the red book on the floor`` will most likely refer to a book visible to the speaker and not the one behind her back. So vision and natural language processing can mutually and incrementally constrain each other. In this project, we will tackle the problem of tightly integrating visual scene understanding with natural language understanding. We believe that for robots to reach human-like performance in natural interactions, the vision, natural language, and also action subsystems of the robotic architecture need to be very tightly integrated to be able to mutually constrain each other. This, in turn, requires concurrent processing of vision, language, and actions where all algorithms must be interruptible and able to incorporate new information incrementally on the fly. It also requires a software framework that allows seamless integration of components and algorithms at a fine temporal granularity. By providing such a tightly integrated system, robots will be able to detect objects faster and more reliably, resolve referential expressions of perceivable referents and ambiguous references more quickly, carry out intended actions more quickly, and achieve much more natural dialogues with humans in everyday environments.
The InSitu project looked at the tight integration of machine vision and natural language understanding. Both of these are tough problems in their own right, especially in the context of autonomous robots, which perform tasks in everyday environments, such as put the yellow cup on the table on the shelf. A human given that task would already, while that sentence is being uttered, follow the gaze or gesture of the speaker and look for something yellow.In InSitu these two sub problems were therefore treated together within a common framework. Cues from (even partially) understood utterances help image understanding by focusing analysis on the relevant parts of the image, e.g., yellow objects in the above example. The visual analysis of a complete complex 3-dimensional scene can be quite time- consuming (many seconds even on current computers). We could show that using situation aware attentional mechanisms together with incremental processing methods the detection of relevant objects in the scene could be significantly sped up. Depending on situation and given task different attention channels are used, e.g. colour, shape or position (... the left cup...).Vice versa visual cues from the scene feed back into language understanding. We could show how to learn object categories in a single shot from utterances such as a medkit is a white box with a red cross on it and verbally describable visual features (cross, red). These learned classes generalize to qualitatively similar objects and allow considerations such asThe cross is green is that still a medkit?. Such semantically deep representations thus go beyond the typical statistical methods, which learn similarity from a large number of examples, without being able to explicitly explain where that similarity lies.Especially important features in this context are those that are functionally relevant, so called affordances, like handles (to pick up) or cavities (the inside of a cup, as a container). In InSitu we developed a taxonomy and systematic approach to describe everyday objects in terms of their affordances, and do detect those affordances in 3D scenes.The methods developed within InSitu were evaluated on 3 different robot platforms at the project partners TU Wien and Tufts University / Boston.
- Technische Universität Wien - 100%
Research Output
- 192 Citations
- 28 Publications
-
2015
Title What We Can Learn From the Primate’s Visual System DOI 10.1007/s13218-014-0345-9 Type Journal Article Author Krüger N Journal KI - Künstliche Intelligenz Pages 9-18 -
2014
Title 4D Space-Time Mereotopogeometry-Part Connectivity Calculus for Visual Object Representation DOI 10.1109/icpr.2014.740 Type Conference Proceeding Abstract Author Varadarajan K Pages 4316-4321 -
2015
Title Saliency-Based Object Discovery on RGB-D Data with a Late-Fusion Approach DOI 10.1109/icra.2015.7139441 Type Conference Proceeding Abstract Author GarcÃal G Pages 1866-1873 -
2011
Title Learning What Matters: Combining Probabilistic Models of 2D and 3D Saliency Cues DOI 10.1007/978-3-642-23968-7_14 Type Book Chapter Author Potapova E Publisher Springer Nature Pages 132-142 -
2012
Title Attention-driven Segmentation of Cluttered 3D Scenes. Type Conference Proceeding Abstract Author Potapova E Conference Proc. of the 21st Int. Conf. on Pattern Recognition (ICPR), Tsukuba, Japan, 2012 -
2012
Title Web Mining Driven Object Locality Knowledge Acquisition for Efficient Robot Behavior. Type Conference Proceeding Abstract Author Vincze M Et Al -
2012
Title My Robot is Smarter than Your Robot - On the Need for a Total Turing Test for Robots. Type Conference Proceeding Abstract Author Zillich M Conference AISB/IACAP Symposium - Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World, Birmingham, UK, 2012 -
2012
Title Robust Multiple Model Estimation with Jensen-Shannon Divergence. Type Conference Proceeding Abstract Author Vincze M Et Al Conference Proc. of the 21st Int. Conf. on Pattern Recognition (ICPR), Tsukuba, Japan, 2012, 4 p, ISBN -
2012
Title Web Mining Driven Object Locality Knowledge Acquisition for Efficient Robot Behavior DOI 10.1109/iros.2012.6385931 Type Conference Proceeding Abstract Author Zhou K Pages 3962-3969 -
2014
Title Learning of perceptual grouping for object segmentation on RGB-D data DOI 10.1016/j.jvcir.2013.04.006 Type Journal Article Author Richtsfeld A Journal Journal of Visual Communication and Image Representation Pages 64-73 Link Publication -
2014
Title Incremental Attention-Driven Object Segmentation**The research leading to these results has received funding from the Austrian Science Fund (FWF) under grant agreement No. TRP 139-N23 InSitu and from the European Community's Seventh Framework Pro DOI 10.1109/humanoids.2014.7041368 Type Conference Proceeding Abstract Author Potapova E Pages 252-258 -
2014
Title Learning to Recognize Novel Objects in One Shot through Human-Robot Interactions in Natural Language Dialogues. Type Conference Proceeding Abstract Author Karuse E Conference Twenty-Eighth Conference on Artificial Intelligence (AAAI) -
2014
Title Incremental Attention-driven Object Segmentation. Type Conference Proceeding Abstract Author Potapova E -
2013
Title Incrementally Biasing Visual Search Using Natural Language Input. Type Conference Proceeding Abstract Author Krause E Conference Proc. of the 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS) -
2013
Title Probabilistic Cue Integration for Real-Time Object Pose Tracking DOI 10.1007/978-3-642-39402-7_26 Type Book Chapter Author Prankl J Publisher Springer Nature Pages 254-263 -
2013
Title Gaussian-weighted Jensen–Shannon divergence as a robust fitness function for multi-model fitting DOI 10.1007/s00138-013-0513-1 Type Journal Article Author Zhou K Journal Machine Vision and Applications Pages 1107-1119 Link Publication -
2013
Title Advances in real-time object tracking - Extensions for robust object tracking with a Monte-Carlo particle filter. Type Journal Article Author Mörwald T -
2013
Title Geometric data abstraction using B-splines for range image segmentation DOI 10.1109/icra.2013.6630569 Type Conference Proceeding Abstract Author Morwald T Pages 148-153 -
2013
Title Spatial Structure Analysis for Autonomous Robotic Vision Systems DOI 10.1109/worv.2013.6521933 Type Conference Proceeding Abstract Author Zhou K Pages 165-170 -
2013
Title Advances in real-time object tracking DOI 10.1007/s11554-013-0388-4 Type Journal Article Author Mörwald T Journal Journal of Real-Time Image Processing Pages 683-697 Link Publication -
2014
Title What Vision Can, Can’t and Should Do DOI 10.1007/978-3-319-06614-1_9 Type Book Chapter Author Zillich M Publisher Springer Nature Pages 119-131 -
2014
Title Attention-Driven Object Detection and Segmentation of Cluttered Table Scenes using 2.5D Symmetry DOI 10.1109/icra.2014.6907584 Type Conference Proceeding Abstract Author Potapova E Pages 4946-4952 -
2014
Title From Animals to Robots and Back: Reflections on Hard Problems in the Study of Cognition, A Collection in Honour of Aaron Sloman DOI 10.1007/978-3-319-06614-1 Type Book editors Wyatt J, Petters D, Hogg D Publisher Springer Nature -
2013
Title 3D Information as a Way to Improve the Quality of Attention Points. Type Conference Proceeding Abstract Author Potapova E Conference Proc. of the Austrian Robotics Workshop, Graz, 2013 -
2013
Title Anytime Perceptual Grouping of 2D Features into 3D Basic Shapes DOI 10.1007/978-3-642-39402-7_8 Type Book Chapter Author Richtsfeld A Publisher Springer Nature Pages 73-82 -
2013
Title A Pilot Study on Eye-tracking in 3D Search Tasks. Type Conference Proceeding Abstract Author Pirri F Et Al Conference Workshop on Solutions for Automatic Gaze Data Analysis (SAGA), Bielefeld, 2013 -
2011
Title Language-modulated attention and its tight coupling to visual processes (poster). Type Conference Proceeding Abstract Author Potapova E Conference Rovereto Attention Workshop: Attention and Objects, 2011 -
2013
Title Local 3D Symmetry for Visual Saliency in 2.5D Point Clouds DOI 10.1007/978-3-642-37331-2_33 Type Book Chapter Author Potapova E Publisher Springer Nature Pages 434-445