Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/140132
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorWu, Qi-
dc.contributor.advisorQi, Yuankai-
dc.contributor.authorQiao, Yanyuan-
dc.date.issued2023-
dc.identifier.urihttps://hdl.handle.net/2440/140132-
dc.description.abstractThe field of Vision and Language has aroused significant interest and holds tremendous potential for real applications, particularly in the area of Vision-and-Language Navigation (VLN). The VLN task enables robots to understand navigation instructions expressed in natural language, perceive the environment, and execute corresponding actions, making it applicable in various scenarios such as home assistants. Despite considerable progress in advancing the development of VLN, several challenges persist and warrant further attention. These challenges include the lack of pre-training models that emphasize temporal information specific to VLN, the necessity for parameterefficient transfer learning techniques to effectively utilize pre-training models, and the exploration of Large Language Models (LLMs) to leverage their extensive knowledge for enhanced performance in VLN. In this thesis, we propose a series of new methods to address these challenges. First, we introduce a history-enhanced and order-aware pre-training and fine-tuning paradigm for VLN. We design three VLN-specific proxy tasks: Action Prediction with History (APH) task, Trajectory Order Modeling (TOM) task and Group Order Modeling (GOM) task. Furthermore, we develop a memory network to address the representation inconsistency of history context between the pre-training and the fine-tuning stages. Second, we propose the first study exploring Parameter-Efficient Transfer Learning (PETL) methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB), which are integrated with existing PETL methods such as Adapter and LoRA to form the comprehensive VLN-PETL framework. Finally, we present a March-in- Chat (MiC) model, enabling conversations between the REVERIE agent and an LLM proactive planning of future steps. This model contains three modules: Goal-Oriented Static Planning (GOSiP) module, Scene-Oriented Dynamic Planning (SODiP) module, and one Room-and-Object Aware Scene Perceiver (ROASeP) module. Through Extensive quantitative and qualitative experiments, we demonstrate the efficiency and potential of our contributions to advancing the field of VLN.en
dc.language.isoenen
dc.subjectvision and languageen
dc.subjectvision-and-language navigationen
dc.subjectdeep learningen
dc.titleGeneral Vision and Language Methods in Real Applications: A Focus on Vision-and-Language Navigationen
dc.typeThesisen
dc.contributor.schoolSchool of Computer and Mathematical Sciencesen
dc.provenanceThis electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legalsen
dc.description.dissertationThesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2023en
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
Qiao2023_PhD.pdf7.47 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.