- -

Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Sáez Silvestre, Carlos es_ES
dc.contributor.author Romero, Nekane es_ES
dc.contributor.author Conejero, J. Alberto es_ES
dc.contributor.author Garcia-Gomez, Juan M es_ES
dc.date.accessioned 2022-11-14T19:02:07Z
dc.date.available 2022-11-14T19:02:07Z
dc.date.issued 2021-02 es_ES
dc.identifier.issn 1067-5027 es_ES
dc.identifier.uri http://hdl.handle.net/10251/189724
dc.description.abstract [EN] Objective: The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. Materials and Methods: We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. Results: Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. Conclusions: Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning. es_ES
dc.description.sponsorship This work was supported by Universitat Politecnica de Valencia contract no. UPV-SUB.2-1302 and FONDO SUPERA COVID-19 by CRUE-Santander Bank grant "Severity Subgroup Discovery and Classification on COVID-19 Real World Data through Machine Learning and Data Quality assessment (SUBCOVERWD-19)." es_ES
dc.language Inglés es_ES
dc.publisher Oxford University Press es_ES
dc.relation.ispartof Journal of the American Medical Informatics Association es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject COVID-19 es_ES
dc.subject Data quality es_ES
dc.subject Machine learning es_ES
dc.subject Biases es_ES
dc.subject Data sharing es_ES
dc.subject Distributed research networks es_ES
dc.subject Multi-site data es_ES
dc.subject Variability es_ES
dc.subject Heterogeneity es_ES
dc.subject Dataset shift es_ES
dc.subject.classification MATEMATICA APLICADA es_ES
dc.subject.classification FISICA APLICADA es_ES
dc.title Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1093/jamia/ocaa258 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/UPV//UPV-SUB.2-1302/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica es_ES
dc.contributor.affiliation Universitat Politècnica de València. Escuela Técnica Superior de Ingenieros Industriales - Escola Tècnica Superior d'Enginyers Industrials es_ES
dc.description.bibliographicCitation Sáez Silvestre, C.; Romero, N.; Conejero, JA.; Garcia-Gomez, JM. (2021). Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset. Journal of the American Medical Informatics Association. 28(2):360-364. https://doi.org/10.1093/jamia/ocaa258 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1093/jamia/ocaa258 es_ES
dc.description.upvformatpinicio 360 es_ES
dc.description.upvformatpfin 364 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 28 es_ES
dc.description.issue 2 es_ES
dc.identifier.pmid 33027509 es_ES
dc.identifier.pmcid PMC7797735 es_ES
dc.relation.pasarela S\435767 es_ES
dc.contributor.funder BANCO SANTANDER, S.A. es_ES
dc.contributor.funder Universitat Politècnica de València es_ES
dc.subject.ods 03.- Garantizar una vida saludable y promover el bienestar para todos y todas en todas las edades es_ES
dc.subject.ods 10.- Reducir las desigualdades entre países y dentro de ellos es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem