Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/135615
Type: Thesis
Title: Interactive Vision and Language Learning
Author: Parvaneh, Amin
Issue Date: 2022
School/Discipline: School of Computer Science
Abstract: Effective and efficient interactions with humans in real environments is an appealing though challenging task for an artificial agent. Despite recent advances in deep learning, especially in the branch of vision and language learning, there are still unsolved issues in the way of reaching such an ambitious agent. Three critical aspects of the interactions between human and machine via natural language (e.g. to create intelligent assistants) are: (1) for the model to understand and anticipate human intents to consistently participate in conversations, (2) to learn from a small set of instances and seek information the model needs to accurately achieve its goals and (3) to generalise with those small number of observations obtained under the supervision of humans so that the agent can be practically used. As for human intent perception, we propose an inclusive model for the visual negotiation task, where the intelligent agent needs to anticipate human intent while communicating via natural language. Our model exploits online resources in search of similar items for the estimation of a fair agreement price humans might set as their goals. Considering the estimated agreement price of the advertised item as well as its visual and textual features (i.e. images and textual descriptions), we build competitive and consistent language and price generation policies that negotiate significantly better than other baselines. For the information-seeking aspect, we propose an effective active learning (AL) method that facilitates learning with less labelled data by seeking a small subset of unlabelled instances that, when labelled and used for the model training, the highest test accuracy can be achieved. We propose efficient interpolations in the feature space between unlabelled and labelled samples to identify unlabelled instances that have inconsistent class predictions in their neighbourhood. After requesting labels of the selected subset from the human expert, we achieve the highest performance boost in the retrained model in comparison to other AL methods. More specifically, our method achieves remarkable results in the low-data regimes on high-dimensional data, where the performances of other AL methods are unsatisfactory. Finally, regarding the generalisation, we equipped the agent with the capability of reasoning about counterfactual scenarios, which discourages the model’s propensity for focusing on spurious features or memorising seen environments. For that, we let the model to intervene in the visual and textual features of the input in a causal model and create counterfactual samples that together with the real observations are used for the training of the model. Hence, the trained model is more resilient to the effect of spurious features and biases in the data and better generalises to unseen situations. Additionally, to increase the generalisation to unseen environments in more interactive applications, we propose a novel approach to generate counterfactual environments and enforce the agent to learn from both the observations and actions in those counterfactual environments. After formalising the supervised and reinforcement learning objectives to include both real and counterfactual environments, our trained agent generalises significantly better than other baselines to unseen environments in two challenging vision-and-language navigation tasks.
Advisor: Shi, Javen Qinfeng
Abbasnejad, Ehsan
Dissertation Note: Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2022
Keywords: Deep Learning
Vision and Language Learning
Active Learning
Vision and Language Navigation
Counterfactual Learning
Visual Negotiation
Provenance: This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
Parvaneh2022_PhD.pdf12.14 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.