Modular graph attention network for complex visual relational reasoning

Zheng, Y.; Wen, Z.; Tan, M.; Zeng, R.; Chen, Q.; Wang, Y.; Wu, Q.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/131666

Scopus	Web of Science®	Altmetric
Citations
?	?

Type:	Conference paper
Title:	Modular graph attention network for complex visual relational reasoning
Author:	Zheng, Y. Wen, Z. Tan, M. Zeng, R. Chen, Q. Wang, Y. Wu, Q.
Citation:	Lecture Notes in Artificial Intelligence, 2021, vol.12627, pp.137-153
Publisher:	Springer
Publisher Place:	Cham, Switzerland
Issue Date:	2021
Series/Report no.:	Lecture Notes in Computer Science; 12627
ISBN:	9783030695439
ISSN:	0302-9743 1611-3349
Conference Name:	Asian Conference on Computer Vision (ACCV) (30 Nov 2020 - 4 Dec 2020 : virtual online)
Statement of Responsibility:	Yihan Zheng, Zhiquan Wen, Mingkui Tan, Runhao Zeng, Qi Chen, Yaowei Wang, Qi Wu
Abstract:	Visual Relational Reasoning is crucial for many vision-and-language based tasks, such as Visual Question Answering and Vision Language Navigation. In this paper, we consider reasoning on complex referring expression comprehension (c-REF) task that seeks to localise the target objects in an image guided by complex queries. Such queries often contain complex logic and thus impose two key challenges for reasoning: (i) It can be very difficult to comprehend the query since it often refers to multiple objects and describes complex relationships among them. (ii) It is non-trivial to reason among multiple objects guided by the query and localise the target correctly. To address these challenges, we propose a novel Modular Graph Attention Network (MGA-Net). Specifically, to comprehend the long queries, we devise a language attention network to decompose them into four types: basic attributes, absolute location, visual relationship and relative locations, which mimics the human language understanding mechanism. Moreover, to capture the complex logic in a query, we construct a relational graph to represent the visual objects and their relationships, and propose a multi-step reasoning method to progressively understand the complex logic. Extensive experiments on CLEVR-Ref+, GQA and CLEVR-CoGenT datasets demonstrate the superior reasoning performance of our MGA-Net.
Rights:	© Springer Nature Switzerland AG 2021
DOI:	10.1007/978-3-030-69544-6_9
Published version:	https://link.springer.com/book/10.1007/978-3-030-69544-6
Appears in Collections:	Aurora harvest 8 Computer Science publications

Files in This Item:

There are no files associated with this item.

Show full item record

Adelaide Research & Scholarship