Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/140132
Type: Thesis
Title: General Vision and Language Methods in Real Applications: A Focus on Vision-and-Language Navigation
Author: Qiao, Yanyuan
Issue Date: 2023
School/Discipline: School of Computer and Mathematical Sciences
Abstract: The field of Vision and Language has aroused significant interest and holds tremendous potential for real applications, particularly in the area of Vision-and-Language Navigation (VLN). The VLN task enables robots to understand navigation instructions expressed in natural language, perceive the environment, and execute corresponding actions, making it applicable in various scenarios such as home assistants. Despite considerable progress in advancing the development of VLN, several challenges persist and warrant further attention. These challenges include the lack of pre-training models that emphasize temporal information specific to VLN, the necessity for parameterefficient transfer learning techniques to effectively utilize pre-training models, and the exploration of Large Language Models (LLMs) to leverage their extensive knowledge for enhanced performance in VLN. In this thesis, we propose a series of new methods to address these challenges. First, we introduce a history-enhanced and order-aware pre-training and fine-tuning paradigm for VLN. We design three VLN-specific proxy tasks: Action Prediction with History (APH) task, Trajectory Order Modeling (TOM) task and Group Order Modeling (GOM) task. Furthermore, we develop a memory network to address the representation inconsistency of history context between the pre-training and the fine-tuning stages. Second, we propose the first study exploring Parameter-Efficient Transfer Learning (PETL) methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB), which are integrated with existing PETL methods such as Adapter and LoRA to form the comprehensive VLN-PETL framework. Finally, we present a March-in- Chat (MiC) model, enabling conversations between the REVERIE agent and an LLM proactive planning of future steps. This model contains three modules: Goal-Oriented Static Planning (GOSiP) module, Scene-Oriented Dynamic Planning (SODiP) module, and one Room-and-Object Aware Scene Perceiver (ROASeP) module. Through Extensive quantitative and qualitative experiments, we demonstrate the efficiency and potential of our contributions to advancing the field of VLN.
Advisor: Wu, Qi
Qi, Yuankai
Dissertation Note: Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2023
Keywords: vision and language
vision-and-language navigation
deep learning
Provenance: This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
Qiao2023_PhD.pdf7.47 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.