Exploring Hybrid Parallel Systems for Probabilistic Record Linkage

Boratto, Murilo; Alonso-Jordá, Pedro; Pinto, Clicia; Melo, Pedro; Barreto, Marcos; Denaxas, Spiros

doi:10.1007/s11227-018-2328-3

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Exploring Hybrid Parallel Systems for Probabilistic Record Linkage

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Boratto;Alonso-Jo ...

Tamaño: 603.6Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: s11227-018-2328-3.pdf

Tamaño: 871.8Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Boratto, Murilo	es_ES
dc.contributor.author	Alonso-Jordá, Pedro	es_ES
dc.contributor.author	Pinto, Clicia	es_ES
dc.contributor.author	Melo, Pedro	es_ES
dc.contributor.author	Barreto, Marcos	es_ES
dc.contributor.author	Denaxas, Spiros	es_ES
dc.date.accessioned	2020-07-15T03:32:17Z
dc.date.available	2020-07-15T03:32:17Z
dc.date.issued	2019-03	es_ES
dc.identifier.issn	0920-8542	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/148002
dc.description.abstract	[EN] Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.	es_ES
dc.description.sponsorship	This work has been partially supported by CNPq, FAPESB, Bill & Melinda Gates Foundation, The Royal Society (UK), Medical Research Council (UK), NVIDIA Hardware Grant Program, Generalitat Valenciana (Grant PROMETEOII/2014/003), Spanish Government and European Commission through TEC2015-67387-C4-1-R (MINECO/FEDER), and network CAPAP-H. We have also worked in cooperation with the EU-COST Programme Action IC1305, "Network for Sustainable Ultrascale Computing (NESUS)	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	The Journal of Supercomputing	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Probabilistic linkage	es_ES
dc.subject	Public health	es_ES
dc.subject	Performance evaluation	es_ES
dc.subject	Multicore	es_ES
dc.subject	Multi-GPU	es_ES
dc.subject.classification	CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL	es_ES
dc.title	Exploring Hybrid Parallel Systems for Probabilistic Record Linkage	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s11227-018-2328-3	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/COST//IC1305/EU/Network for Sustainable Ultrascale Computing (NESUS)/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2014%2F003/ES/Computación y comunicaciones de altas prestaciones y aplicaciones en ingeniería/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO//TEC2015-67387-C4-1-R/ES/SMART SOUND PROCESSING FOR THE DIGITAL LIVING/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Boratto, M.; Alonso-Jordá, P.; Pinto, C.; Melo, P.; Barreto, M.; Denaxas, S. (2019). Exploring Hybrid Parallel Systems for Probabilistic Record Linkage. The Journal of Supercomputing. 75:1137-1149. https://doi.org/10.1007/s11227-018-2328-3	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s11227-018-2328-3	es_ES
dc.description.upvformatpinicio	1137	es_ES
dc.description.upvformatpfin	1149	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	75	es_ES
dc.relation.pasarela	S\382115	es_ES
dc.contributor.funder	Royal Society, Reino Unido	es_ES
dc.contributor.funder	Generalitat Valenciana	es_ES
dc.contributor.funder	Bill and Melinda Gates Foundation	es_ES
dc.contributor.funder	Medical Research Council, Reino Unido	es_ES
dc.contributor.funder	European Cooperation in Science and Technology	es_ES
dc.contributor.funder	Fundação de Amparo à Pesquisa do Estado da Bahia	es_ES
dc.contributor.funder	Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brasil	es_ES
dc.contributor.funder	Ministerio de Economía y Competitividad	es_ES
dc.description.references	Andrade G, Viegas F, Ramos GS, Almeida J, Rocha L, Gonçalves M, Ferreira R (2013) GPU-NB: a fast CUDA-based implementation of Naïve Bayes. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp 168–175	es_ES
dc.description.references	Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426	es_ES
dc.description.references	Cook S (2013) CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann, San Francisco	es_ES
dc.description.references	Doan A, Halevy A, Ives Z (2012) Principles of Data Integration. Elsevier, Amsterdam	es_ES
dc.description.references	Étienne EY (2012) Hyper-threading. TurbsPublishing, Saarbrücken	es_ES
dc.description.references	Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183–1210	es_ES
dc.description.references	Feng X, Jin H, Zheng R, Zhu L (2014) Near-duplicate detection using GPU-based simhash scheme. In: 2014 International Conference on Smart Computing, pp 223–228	es_ES
dc.description.references	Forchhammer B, Papenbrock T, Stening T, Viehmeier S, Naumann U.D.F (2013) Duplicate detection on GPUs. In: BTW. Köllen-Verlag, pp 165–184	es_ES
dc.description.references	Kim H.s, Lee D (2007) Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007. ACM, New York, NY, USA, pp 283–292	es_ES
dc.description.references	Mamun AA, Aseltine R, Rajasekaran S (2015) RLT-S: a web system for record linkage. PLoS ONE 10(5):1–9	es_ES
dc.description.references	Mamun AA, Aseltine R, Rajasekaran S (2016) Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4):1–21	es_ES
dc.description.references	Mamun AA, Mi T, Aseltine R, Rajasekaran S (2014) Efficient sequential and parallel algorithms for record linkage. J Am Med Inform Assoc 21(2):252–262	es_ES
dc.description.references	Mizell E, Biery R (2017) How GPUs are defining the future of data analytics	es_ES
dc.description.references	Munshi A, Gaster B, Mattson TG, Fung J, Ginsburg D (2011) OpenCL Programming Guide, 1st edn. Addison-Wesley, Reading	es_ES
dc.description.references	NVIDIA Corporation: NVIDIA CUDA C programming guide (2010). Version 3.2	es_ES
dc.description.references	OpenMP Architecture Review Board: OpenMP application program interface version 4.0 (2013)	es_ES
dc.description.references	Pokorny J (2011) NoSQL databases: a step to database scalability in web environment. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS ’11. ACM, New York, NY, USA, pp 278–283	es_ES
dc.description.references	Rendle S, Schmidt-Thieme L (2008) Scaling Record Linkage to Non-uniform Distributed Class Sizes. Springer, Berlin, pp 308–319	es_ES
dc.description.references	Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with ppjoin. In: Datenbanksysteme für Business, Technologie und Web (BTW), pp 85–104	es_ES
dc.description.references	Winkler WE (1999) The state of record linkage and current research problems	es_ES
dc.description.references	Zhong Z, Rychkov V, Lastovetsky A (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [48344]

Mostrar el registro sencillo del ítem

Exploring Hybrid Parallel Systems for Probabilistic Record Linkage

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Exploring Hybrid Parallel Systems for Probabilistic Record Linkage

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)