Sample Size vs. Bias in Defect Prediction

Rahman, Foyzur and Posnett, Daryl and Herraiz Tabernero, Israel and Devanbu, Premkumar (2013). Sample Size vs. Bias in Defect Prediction. In: "9th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering", August 2013. ISBN 978-1-4503-2237-9.

Description

Title: Sample Size vs. Bias in Defect Prediction
Author/s:
  • Rahman, Foyzur
  • Posnett, Daryl
  • Herraiz Tabernero, Israel
  • Devanbu, Premkumar
Item Type: Presentation at Congress or Conference (Article)
Event Title: 9th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering
Event Dates: August 2013
Title of Book: Proceedings of the 9th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering
Date: August 2013
ISBN: 978-1-4503-2237-9
Subjects:
Faculty: E.T.S.I. Caminos, Canales y Puertos (UPM)
Department: Matemática e Informática Aplicadas a la Ingeniería Civil [hasta 2014]
Creative Commons Licenses: Recognition

Full text

[img]
Preview
PDF (Versión actualizada en Julio de 2013) - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (347kB) | Preview

Abstract

Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significants concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more? bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.

More information

Item ID: 15712
DC Identifier: http://oa.upm.es/15712/
OAI Identifier: oai:oa.upm.es:15712
Official URL: http://esec-fse.inf.ethz.ch/
Deposited by: Israel Herraiz
Deposited on: 09 Jun 2013 13:54
Last Modified: 21 Apr 2016 15:59
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM