General Vision and Language Methods in Real Applications: A Focus on Vision-and-Language Navigation

Qiao, Yanyuan

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/140132

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Wu, Qi	-
dc.contributor.advisor	Qi, Yuankai	-
dc.contributor.author	Qiao, Yanyuan	-
dc.date.issued	2023	-
dc.identifier.uri	https://hdl.handle.net/2440/140132	-
dc.description.abstract	The field of Vision and Language has aroused significant interest and holds tremendous potential for real applications, particularly in the area of Vision-and-Language Navigation (VLN). The VLN task enables robots to understand navigation instructions expressed in natural language, perceive the environment, and execute corresponding actions, making it applicable in various scenarios such as home assistants. Despite considerable progress in advancing the development of VLN, several challenges persist and warrant further attention. These challenges include the lack of pre-training models that emphasize temporal information specific to VLN, the necessity for parameterefficient transfer learning techniques to effectively utilize pre-training models, and the exploration of Large Language Models (LLMs) to leverage their extensive knowledge for enhanced performance in VLN. In this thesis, we propose a series of new methods to address these challenges. First, we introduce a history-enhanced and order-aware pre-training and fine-tuning paradigm for VLN. We design three VLN-specific proxy tasks: Action Prediction with History (APH) task, Trajectory Order Modeling (TOM) task and Group Order Modeling (GOM) task. Furthermore, we develop a memory network to address the representation inconsistency of history context between the pre-training and the fine-tuning stages. Second, we propose the first study exploring Parameter-Efficient Transfer Learning (PETL) methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB), which are integrated with existing PETL methods such as Adapter and LoRA to form the comprehensive VLN-PETL framework. Finally, we present a March-in- Chat (MiC) model, enabling conversations between the REVERIE agent and an LLM proactive planning of future steps. This model contains three modules: Goal-Oriented Static Planning (GOSiP) module, Scene-Oriented Dynamic Planning (SODiP) module, and one Room-and-Object Aware Scene Perceiver (ROASeP) module. Through Extensive quantitative and qualitative experiments, we demonstrate the efficiency and potential of our contributions to advancing the field of VLN.	en
dc.language.iso	en	en
dc.subject	vision and language	en
dc.subject	vision-and-language navigation	en
dc.subject	deep learning	en
dc.title	General Vision and Language Methods in Real Applications: A Focus on Vision-and-Language Navigation	en
dc.type	Thesis	en
dc.contributor.school	School of Computer and Mathematical Sciences	en
dc.provenance	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals	en
dc.description.dissertation	Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2023	en
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
Qiao2023_PhD.pdf		7.47 MB	Adobe PDF	View/Open

Show simple item record

Adelaide Research & Scholarship