Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant

Wiese, I.; Da Silva, J.; Steinmacher, I.; Treude, C.; Gerosa, M.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/107036

Scopus	Web of Science®	Altmetric
Citations
?	?

Full metadata record

DC Field	Value	Language
dc.contributor.author	Wiese, I.	-
dc.contributor.author	Da Silva, J.	-
dc.contributor.author	Steinmacher, I.	-
dc.contributor.author	Treude, C.	-
dc.contributor.author	Gerosa, M.	-
dc.date.issued	2017	-
dc.identifier.citation	Proceedings of the 32nd IEEE International Conference on Software Maintenance and Evolution (ICSME ), 2017, pp.345-355	-
dc.identifier.isbn	9781509038060	-
dc.identifier.issn	1063-6773	-
dc.identifier.uri	http://hdl.handle.net/2440/107036	-
dc.description.abstract	Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.	-
dc.description.statementofresponsibility	Igor Scaliante Wiese, José Teodoro da Silva, Igor Steinmacher, Christoph Treude, Marco Aurélio Gerosa	-
dc.language.iso	en	-
dc.publisher	IEEE	-
dc.relation.ispartofseries	Proceedings-IEEE International Conference on Software Maintenance	-
dc.rights	© 2016 IEEE	-
dc.source.uri	http://dx.doi.org/10.1109/icsme.2016.13	-
dc.subject	Email address disambiguation; mailing lists; Apache Software Foundation; mining software repositories	-
dc.title	Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant	-
dc.type	Conference paper	-
dc.contributor.conference	32nd IEEE International Conference on Software Maintenance and Evolution (ICSME ) (2 Oct 2016 - 7 Oct 2016 : Raleigh, North Carolina)	-
dc.identifier.doi	10.1109/ICSME.2016.13	-
dc.publisher.place	online	-
pubs.publication-status	Published	-
dc.identifier.orcid	Treude, C. [0000-0002-6919-2149]	-
Appears in Collections:	Aurora harvest 3 Computer Science publications

Files in This Item:

There are no files associated with this item.

Show simple item record

Adelaide Research & Scholarship