Projectdetail

Grant DOI 10.55776/TRP139
Funding program Translational Research
Status ended
Start March 1, 2011
End February 28, 2015
Funding amount € 369,306

Disciplines

Electrical Engineering, Electronics, Information Engineering (20%); Computer Sciences (50%); Linguistics and Literature (30%)

Keywords

Computer Vision,
Cognitive Systems,
Natural language Understanding,
Integration,
Robotics

Abstract

Final report

Recent years have seen major advances in personal and assistive robots (from household robots to robots for elder care). Yet we are still far away from truly natural human-robot interaction in everyday situations and environments. Robust visual scene understanding and natural language understanding on robots are currently two of the major road blocks. We believe that this is in part due to the fact that these two are often treated separately. Note that typical dialogue between humans situated in the same scene is full of cases where vision and dialogue are used jointly to ground a common understanding. Humans tend to look towards an object that they are currently referring to in dialogue, guiding the attention of the dialogue partner. Similarly object attributes extracted from (even partly) parsed utterances like "Could you hand me the red ...`` will guide the search for the respective object. Vice versa, visually observing a scene that is being talked about supports understanding of ambiguous or underspecified utterances while they are being processed - "the red book on the floor`` will most likely refer to a book visible to the speaker and not the one behind her back. So vision and natural language processing can mutually and incrementally constrain each other. In this project, we will tackle the problem of tightly integrating visual scene understanding with natural language understanding. We believe that for robots to reach human-like performance in natural interactions, the vision, natural language, and also action subsystems of the robotic architecture need to be very tightly integrated to be able to mutually constrain each other. This, in turn, requires concurrent processing of vision, language, and actions where all algorithms must be interruptible and able to incorporate new information incrementally on the fly. It also requires a software framework that allows seamless integration of components and algorithms at a fine temporal granularity. By providing such a tightly integrated system, robots will be able to detect objects faster and more reliably, resolve referential expressions of perceivable referents and ambiguous references more quickly, carry out intended actions more quickly, and achieve much more natural dialogues with humans in everyday environments.

The InSitu project looked at the tight integration of machine vision and natural language understanding. Both of these are tough problems in their own right, especially in the context of autonomous robots, which perform tasks in everyday environments, such as put the yellow cup on the table on the shelf. A human given that task would already, while that sentence is being uttered, follow the gaze or gesture of the speaker and look for something yellow.In InSitu these two sub problems were therefore treated together within a common framework. Cues from (even partially) understood utterances help image understanding by focusing analysis on the relevant parts of the image, e.g., yellow objects in the above example. The visual analysis of a complete complex 3-dimensional scene can be quite time- consuming (many seconds even on current computers). We could show that using situation aware attentional mechanisms together with incremental processing methods the detection of relevant objects in the scene could be significantly sped up. Depending on situation and given task different attention channels are used, e.g. colour, shape or position (... the left cup...).Vice versa visual cues from the scene feed back into language understanding. We could show how to learn object categories in a single shot from utterances such as a medkit is a white box with a red cross on it and verbally describable visual features (cross, red). These learned classes generalize to qualitatively similar objects and allow considerations such asThe cross is green is that still a medkit?. Such semantically deep representations thus go beyond the typical statistical methods, which learn similarity from a large number of examples, without being able to explicitly explain where that similarity lies.Especially important features in this context are those that are functionally relevant, so called affordances, like handles (to pick up) or cavities (the inside of a cup, as a container). In InSitu we developed a taxonomy and systematic approach to describe everyday objects in terms of their affordances, and do detect those affordances in 3D scenes.The methods developed within InSitu were evaluated on 3 different robot platforms at the project partners TU Wien and Tufts University / Boston.

Research institution(s)

Technische Universität Wien - 100%

Research Output

192 Citations
28 Publications

Publications

Title	3D Information as a Way to Improve the Quality of Attention Points.
Type	Conference Proceeding Abstract
Author	Potapova E
Conference	Proc. of the Austrian Robotics Workshop, Graz, 2013

Title	Advances in real-time object tracking - Extensions for robust object tracking with a Monte-Carlo particle filter.
Type	Journal Article
Author	Mörwald T

Title	Local 3D Symmetry for Visual Saliency in 2.5D Point Clouds
DOI	10.1007/978-3-642-37331-2_33
Type	Book Chapter
Author	Potapova E
Publisher	Springer Nature
Pages	434-445

Title	Spatial Structure Analysis for Autonomous Robotic Vision Systems
DOI	10.1109/worv.2013.6521933
Type	Conference Proceeding Abstract
Author	Zhou K
Pages	165-170

Title	Geometric data abstraction using B-splines for range image segmentation
DOI	10.1109/icra.2013.6630569
Type	Conference Proceeding Abstract
Author	Morwald T
Pages	148-153

Title	Learning of perceptual grouping for object segmentation on RGB-D data
DOI	10.1016/j.jvcir.2013.04.006
Type	Journal Article
Author	Richtsfeld A
Journal	Journal of Visual Communication and Image Representation
Pages	64-73
Link	Publication

Title	What Vision Can, Can’t and Should Do
DOI	10.1007/978-3-319-06614-1_9
Type	Book Chapter
Author	Zillich M
Publisher	Springer Nature
Pages	119-131

Title	Saliency-Based Object Discovery on RGB-D Data with a Late-Fusion Approach
DOI	10.1109/icra.2015.7139441
Type	Conference Proceeding Abstract
Author	Garcíal G
Pages	1866-1873

Title	Web Mining Driven Object Locality Knowledge Acquisition for Efficient Robot Behavior
DOI	10.1109/iros.2012.6385931
Type	Conference Proceeding Abstract
Author	Zhou K
Pages	3962-3969

Title	Language-modulated attention and its tight coupling to visual processes (poster).
Type	Conference Proceeding Abstract
Author	Potapova E
Conference	Rovereto Attention Workshop: Attention and Objects, 2011

Title	Learning What Matters: Combining Probabilistic Models of 2D and 3D Saliency Cues
DOI	10.1007/978-3-642-23968-7_14
Type	Book Chapter
Author	Potapova E
Publisher	Springer Nature
Pages	132-142

Title	A Pilot Study on Eye-tracking in 3D Search Tasks.
Type	Conference Proceeding Abstract
Author	Pirri F Et Al
Conference	Workshop on Solutions for Automatic Gaze Data Analysis (SAGA), Bielefeld, 2013

Title	Anytime Perceptual Grouping of 2D Features into 3D Basic Shapes
DOI	10.1007/978-3-642-39402-7_8
Type	Book Chapter
Author	Richtsfeld A
Publisher	Springer Nature
Pages	73-82

Title	Probabilistic Cue Integration for Real-Time Object Pose Tracking
DOI	10.1007/978-3-642-39402-7_26
Type	Book Chapter
Author	Prankl J
Publisher	Springer Nature
Pages	254-263

Title	Robust Multiple Model Estimation with Jensen-Shannon Divergence.
Type	Conference Proceeding Abstract
Author	Vincze M Et Al
Conference	Proc. of the 21st Int. Conf. on Pattern Recognition (ICPR), Tsukuba, Japan, 2012, 4 p, ISBN

Title	What We Can Learn From the Primate’s Visual System
DOI	10.1007/s13218-014-0345-9
Type	Journal Article
Author	Krüger N
Journal	KI - Künstliche Intelligenz
Pages	9-18

Title	Incremental Attention-Driven Object Segmentation**The research leading to these results has received funding from the Austrian Science Fund (FWF) under grant agreement No. TRP 139-N23 InSitu and from the European Community's Seventh Framework Pro
DOI	10.1109/humanoids.2014.7041368
Type	Conference Proceeding Abstract
Author	Potapova E
Pages	252-258

Title	Incremental Attention-driven Object Segmentation.
Type	Conference Proceeding Abstract
Author	Potapova E

Title	Learning to Recognize Novel Objects in One Shot through Human-Robot Interactions in Natural Language Dialogues.
Type	Conference Proceeding Abstract
Author	Karuse E
Conference	Twenty-Eighth Conference on Artificial Intelligence (AAAI)

Title	Attention-Driven Object Detection and Segmentation of Cluttered Table Scenes using 2.5D Symmetry
DOI	10.1109/icra.2014.6907584
Type	Conference Proceeding Abstract
Author	Potapova E
Pages	4946-4952

Title	4D Space-Time Mereotopogeometry-Part Connectivity Calculus for Visual Object Representation
DOI	10.1109/icpr.2014.740
Type	Conference Proceeding Abstract
Author	Varadarajan K
Pages	4316-4321

Title	Attention-driven Segmentation of Cluttered 3D Scenes.
Type	Conference Proceeding Abstract
Author	Potapova E
Conference	Proc. of the 21st Int. Conf. on Pattern Recognition (ICPR), Tsukuba, Japan, 2012

Title	Web Mining Driven Object Locality Knowledge Acquisition for Efficient Robot Behavior.
Type	Conference Proceeding Abstract
Author	Vincze M Et Al

Title	My Robot is Smarter than Your Robot - On the Need for a Total Turing Test for Robots.
Type	Conference Proceeding Abstract
Author	Zillich M
Conference	AISB/IACAP Symposium - Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World, Birmingham, UK, 2012

Title	From Animals to Robots and Back: Reflections on Hard Problems in the Study of Cognition, A Collection in Honour of Aaron Sloman
DOI	10.1007/978-3-319-06614-1
Type	Book
editors	Wyatt J, Petters D, Hogg D
Publisher	Springer Nature

Title	Incrementally Biasing Visual Search Using Natural Language Input.
Type	Conference Proceeding Abstract
Author	Krause E
Conference	Proc. of the 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS)

Title	Advances in real-time object tracking
DOI	10.1007/s11554-013-0388-4
Type	Journal Article
Author	Mörwald T
Journal	Journal of Real-Time Image Processing
Pages	683-697
Link	Publication

Title	Gaussian-weighted Jensen–Shannon divergence as a robust fitness function for multi-model fitting
DOI	10.1007/s00138-013-0513-1
Type	Journal Article
Author	Zhou K
Journal	Machine Vision and Applications
Pages	1107-1119
Link	Publication

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

InSitu - Integrated Situated Visual Scene and Natural Language Understanding for Human Robot Interaction

Disciplines

Keywords

Research Output

Contact

General information

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

SOCIAL MEDIA

SCILOG

InSitu - Integrated Situated Visual Scene and Natural Language Understanding for Human Robot Interaction

Disciplines

Keywords

Research Output