Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/137237
Citations
Scopus Web of Science® Altmetric
?
?
Type: Journal article
Title: Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT
Author: Elangovan, A.
Li, Y.
Pires, D.E.V.
Davis, M.J.
Verspoor, K.
Citation: BMC Bioinformatics, 2022; 23(1)
Publisher: Springer Science and Business Media LLC
Issue Date: 2022
ISSN: 1471-2105
1471-2105
Statement of
Responsibility: 
Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis, and Karin Verspoor
Abstract: Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.
Keywords: Protein-protein interaction; Post-translational modifications; BioBERT; Natural language processing; Deep learning; Distant supervision
Rights: © The Author(s), 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
DOI: 10.1186/s12859-021-04504-x
Grant ID: http://purl.org/au-research/grants/arc/DP190101350
http://purl.org/au-research/grants/arc/LP160101469
Published version: http://dx.doi.org/10.1186/s12859-021-04504-x
Appears in Collections:Medical Sciences publications

Files in This Item:
File Description SizeFormat 
hdl_137237.pdfPublished version3.04 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.