General and Fine-Grained Video Understanding using Machine Learning & Standardised Neural Network Architectures

Faulkner, Hayden James

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/133600

Type:	Thesis
Title:	General and Fine-Grained Video Understanding using Machine Learning & Standardised Neural Network Architectures
Author:	Faulkner, Hayden James
Issue Date:	2021
School/Discipline:	School of Computer Science
Abstract:	Recently, the broad adoption of the internet coupled with connected smart devices has seen a significant increase in the production, collection, and sharing of data. One of the biggest technical challenges in this new information age is how to effectively use machines to process and extract useful information from this data. Interpreting video data is of particular importance for many applications including surveillance, cataloguing, and robotics, however it is also particularly difficult due to video’s natural sparseness - for lots of data there is small amounts of useful information. This thesis examines and extends a number of Machine Learning models in a number of video understanding problem domains including captioning, detection and classification. Captioning videos with human like sentences can be considered a good indication of how well a machine can interpret and distill the contents of a video. Captioning generally requires knowledge of the scene, objects, actions, relationships and temporal dynamics. Current approaches break this problem into three stages with most works focusing on visual feature filtering techniques for supporting a caption generation module. Current approaches however still struggle to associate ideas described in captions with their visual components in the video. We find that captioning models tend to generate shorter more succinct captions, with overfitted training models performing significantly better than human annotators on the current evaluation metrics. After taking a closer look at the model and human generated captions we highlight that the main challenge for captioning models is to correctly identify and generate specific nouns and verbs, particularly rare concepts. With this in mind we experimentally analyse a handful of different concept grounding techniques, showing some to be promising in increasing captioning performance, particularly when concepts are identified correctly by the grounding mechanism. To strengthen visual interpretations, recent captioning approaches utilise object detections to attain more salient and detailed visual information. Currently, these detections are generated by an image based detector processing only a single video frame, however it’s desirable to capture the temporal dynamics of objects across an entire video. We take an efficient image object detection framework, and carry out an extensive exploration into the effects of a number of network modifications towards improving the model’s ability to perform on video data. We find a number of promising directions which improve upon the single frame baseline. Furthermore, to increase concept coverage for object detection in video we combine datasets from both the image and video domains. We then perform an in-depth analysis on the coverage of the combined detection dataset with the concepts found in captions from video captioning datasets. While the bulk of this thesis centres around general video understanding - random videos from the internet - it’s also useful to determine the performance of these Machine Learning techniques on a more fine-grained problem. We therefore introduce a new Tennis dataset, which includes broadcast video for five tennis matches with detailed annotations for match events and commentary style captions. We evaluate a number of modern Machine Learning techniques for performing shot classification, as a stand-alone and a precursor process for commentary generation, finding that current models are similarly effective for this fine-grained problem.
Advisor:	Dick, Anthony
Dissertation Note:	Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2021
Keywords:	Machine learning deep learning video understanding neural networks
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
Faulkner2021_PhD.pdf		9.05 MB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship