HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation

Qiao, Y.; Qi, Y.; Hong, Y.; Yu, Z.; Wang, P.; Wu, Q.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/139132

Scopus	Web of Science®	Altmetric
Citations
?	?

Full metadata record

DC Field	Value	Language
dc.contributor.author	Qiao, Y.	-
dc.contributor.author	Qi, Y.	-
dc.contributor.author	Hong, Y.	-
dc.contributor.author	Yu, Z.	-
dc.contributor.author	Wang, P.	-
dc.contributor.author	Wu, Q.	-
dc.date.issued	2023	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023; 45(7):8524-8537	-
dc.identifier.issn	0162-8828	-
dc.identifier.issn	2160-9292	-
dc.identifier.uri	https://hdl.handle.net/2440/139132	-
dc.description.abstract	Recent works attempt to employ pre-training in Vision-and-Language Navigation (VLN). However, these methods neglect the importance of historical contexts or ignore predicting future actions during pre-training, limiting the learning of visual-textual correspondence and the capability of decision-making. To address these problems, we present a history-enhanced and order-aware pre-training with the complementing fine-tuning paradigm (HOP+) for VLN. Specifically, besides the common Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM) tasks, we design three novel VLN-specific proxy tasks: Action Prediction with History (APH) task, Trajectory Order Modeling (TOM) task and Group Order Modeling (GOM) task. APH task takes into account the visual perception trajectory to enhance the learning of historical knowledge as well as action prediction. The two temporal visualtextual alignment tasks, TOM and GOM further improve the agent’s ability to order reasoning. Moreover, we design a memory network to address the representation inconsistency of history context between the pre-training and the fine-tuning stages. The memory network effectively selects and summarizes historical information for action prediction during fine-tuning, without costing huge extra computation consumption for downstream VLN tasks. HOP+ achieves new state-of-the-art performance on four downstream VLN tasks (R2R, REVERIE, RxR, and NDH), which demonstrates the effectiveness of our proposed method.	-
dc.description.statementofresponsibility	Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu	-
dc.language.iso	en	-
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)	-
dc.rights	© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.	-
dc.source.uri	http://dx.doi.org/10.1109/tpami.2023.3234243	-
dc.subject	Vision-and-language navigation; pre-training; memory networks	-
dc.title	HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation	-
dc.type	Journal article	-
dc.identifier.doi	10.1109/tpami.2023.3234243	-
dc.relation.grant	http://purl.org/au-research/grants/arc/DE190100539	-
pubs.publication-status	Published	-
dc.identifier.orcid	Qiao, Y. [0000-0002-5606-0702]	-
dc.identifier.orcid	Wu, Q. [0000-0003-3631-256X]	-
Appears in Collections:	Australian Institute for Machine Learning publications

Files in This Item:

There are no files associated with this item.

Show simple item record

Adelaide Research & Scholarship