Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics

Padulano, Vincenzo Eduardo; Kabadzhov, Ivan Donchev; Tejedor Saavedra, Enric; Guiraud, Enrico; Alonso-Jordá, Pedro

doi:10.1007/s10723-023-09645-2

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: PadulanoKabadzhov ...

Tamaño: 1.929Mb

Formato: PDF

Descripción: Versión editorial

Abrir

dc.contributor.author	Padulano, Vincenzo Eduardo	es_ES
dc.contributor.author	Kabadzhov, Ivan Donchev	es_ES
dc.contributor.author	Tejedor Saavedra, Enric	es_ES
dc.contributor.author	Guiraud, Enrico	es_ES
dc.contributor.author	Alonso-Jordá, Pedro	es_ES
dc.date.accessioned	2023-03-01T19:02:10Z
dc.date.available	2023-03-01T19:02:10Z
dc.date.issued	2023-02-10	es_ES
dc.identifier.issn	1570-7873	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/192206
dc.description.abstract	[EN] The Large Hadron Collider (LHC) at CERN has generated a vast amount of information from physics events, reaching peaks of TB of data per day which are then sent to large storage facilities. Traditionally, data processing workflows in the High Energy Physics (HEP) field have leveraged grid computing resources. In this context, users have been responsible for manually parallelising the analysis, sending tasks to computing nodes and aggregating the partial results. Analysis environments in this field have had a common building block in the ROOT software framework. This is the de facto standard tool for storing, processing and visualising HEP data. ROOT offers a modern analysis tool called RDataFrame, which can parallelise computations from a single machine to a distributed cluster while hiding most of the scheduling and result aggregation complexity from users. This is currently done by leveraging Apache Spark as the distributed execution engine, but other alternatives are being explored by HEP research groups. Notably, Dask has rapidly gained popularity thanks to its ability to interface with batch queuing systems, widespread in HEP grid computing facilities. Furthermore, future upgrades of the LHC are expected to bring a dramatic increase in data volumes. This paper presents a novel implementation of the Dask backend for the distributed RDataFrame tool in order to address the aforementioned future trends. The scalability of the tool with both the new backend and the already available Spark backend is demonstrated for the first time on more than two thousand cores, testing a real HEP analysis.	es_ES
dc.description.sponsorship	Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work benefited from the support of grant PID2020-113656RBC22 funded by Ministerio de Ciencia e Innovacion (Spain) MCIN/AEI/10.13039/501100011033.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	Journal of Grid Computing	es_ES
dc.rights	Reconocimiento (by)	es_ES
dc.subject	Root	es_ES
dc.subject	High energy physics	es_ES
dc.subject	Distributed computing	es_ES
dc.subject	Dask	es_ES
dc.subject	Spark	es_ES
dc.subject.classification	CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL	es_ES
dc.title	Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s10723-023-09645-2	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113656RB-C22/ES/COMPUTACION Y COMUNICACIONES DE ALTAS PRESTACIONES CONSCIENTES DEL CONSUMO ENERGETICO. APLICACIONES AL APRENDIZAJE PROFUNDO COMPUTACIONAL - UPV/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica	es_ES
dc.description.bibliographicCitation	Padulano, VE.; Kabadzhov, ID.; Tejedor Saavedra, E.; Guiraud, E.; Alonso-Jordá, P. (2023). Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics. Journal of Grid Computing. 21:1-21. https://doi.org/10.1007/s10723-023-09645-2	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s10723-023-09645-2	es_ES
dc.description.upvformatpinicio	1	es_ES
dc.description.upvformatpfin	21	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	21	es_ES
dc.relation.pasarela	S\482341	es_ES
dc.contributor.funder	Agencia Estatal de Investigación	es_ES
dc.contributor.funder	Universitat Politècnica de València
dc.description.references	Apollinari, G., Béjar Alonso, I., Brüning, O., Fessia, P., Lamont, M., Rossi, L., Tavian, L.: High-luminosity large hadron collider (HL-LHC): technical design report V. 0.1. Technical report CERN. https://doi.org/10.23731/CYRM-2017-004 (2017)	es_ES
dc.description.references	Elsen, E.: A roadmap for HEP software and computing R&D for the 2020s. Comput Softw Big Sci, vol 16(3). https://doi.org/10.1007/s41781-019-0031-6 (2019)	es_ES
dc.description.references	Brun, R., Rademakers, F.: ROOT — an object oriented data analysis framework. Nuclear Instr. Methods Phys. Res. Section A Accelerators, Spectrometers, Detectors Assoc. Equip. 389(1), 81–86 (1997). https://doi.org/10.1016/S0168-9002(97)00048-X. New computing techniques in physics research V	es_ES
dc.description.references	Blomer, J., Canal, P., Naumann, A., Piparo, D.: Evolution of the ROOT tree I/O. EPJ Web Conf. 245, 02030 (2020). https://doi.org/10.1051/epjconf/202024502030	es_ES
dc.description.references	Lopez-Gomez, J., Blomer, J.: RNTUple performance: status and outlook. arXiv:2022.09043. https://doi.org/10.48550	es_ES
dc.description.references	Piparo, D., Canal, P., Guiraud, E., Valls Pla, X., Ganis, G., Amadio, G., Naumann, A., Tejedor Saavedra, E.: RDAtaframe: easy parallel ROOT analysis at 100 threads. EPJ Web Conf. 214, 06029 (2019). https://doi.org/10.1051/epjconf/201921406029https://doi.org/10.1051/epjconf/201921406029	es_ES
dc.description.references	Bird, I.: Computing for the large hadron collider. Annu. Rev. Nucl. Part. Sci. 61 (1), 99–118 (2011). https://doi.org/10.1146/annurev-nucl-102010-130059	es_ES
dc.description.references	Team, R., Brann, K.A., Amadio, G., An, S., Bellenot, B., Blomer, J., Canal, P., Couet, O., Galli, M., Guiraud, E., Hageboeck, S., Linev, S., Vila, P.M., Moneta, L., Naumann, A., Tadel, A.M., Padulano, V.E., Rademakers, F., Shadura, O., Tadel, M., Saavedra, E.T., Pla, X.V., Vassilev, V., Wunsch, S.: Software challenges for HL-LHC data analysis. arXiv:2004.07675. 10.48550 (2020)	es_ES
dc.description.references	Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor – a Distributed Job Scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux. MIT Press (2001)	es_ES
dc.description.references	Jette, M., Dunlap, C., Garlick, J., Grondona, M.: Slurm: simple linux utility for resource management. Technical report, LLNL. https://www.osti.gov/biblio/15002962 (2002)	es_ES
dc.description.references	Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10, p. 10. USENIX association. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets (2010)	es_ES
dc.description.references	Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference, pp. 130–136. SciPy (2015)	es_ES
dc.description.references	Rilee, M., Griessbaum, N., Kuo, K.-S., Frew, J., Wolfe, R.: STARE-based integrative analysis of diverse data using dask parallel programming demo paper. In: Proceedings of the 28th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’20, pp. 417–420. Association for computing machinery. https://doi.org/10.1145/3397536.3422346 (2020)	es_ES
dc.description.references	Gharat, J., Kumar, B., Ragha, L., Barve, A., Jeelani, S.M., Clyne, J.: Development of NCL equivalent serial and parallel python routines for meteorological data analysis. Int. J. High Performance Comput. Appl., https://doi.org/10.1177/10943420221077110 (2022)	es_ES
dc.description.references	Hamman, J.J., Rocklin, M., Abernathy, R.M.: Pangeo: a big-data ecosystem for scalable earth system science. In: 20th EGU General Assembly, EGU2018, p. 12146. The SAO/NASA astrophysics data system (ADS) (2018)	es_ES
dc.description.references	Fan, S., Linke, M., Paraskevakos, I., Gowers, R.J., Gecht, M., Beckstein, O.: PMDA - Parallel molecular dynamics analysis. In: Calloway, C., Lippa, D., Niederhut, D., Shupe, D. (eds.) Proceedings of the 18th Python in Science Conference, pp. 134–142. SciPy. https://doi.org/10.25080/Majora-7ddc1dd1-013 (2019)	es_ES
dc.description.references	Dask: dask.dataframe documentation. https://docs.dask.org/en/stable/dataframe.html . Accessed 25 Nov 2022 (2022)	es_ES
dc.description.references	Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1, 145–164 (2016). https://doi.org/10.1007/s41060-016-0027-9	es_ES
dc.description.references	Khan, M.A., Karim, M.R., Kim, Y.: A two-stage big data analytics framework with real world applications using spark machine learning and long Short-Term memory network. Symmetry, vol. 10(10). https://doi.org/10.3390/sym10100485 (2018)	es_ES
dc.description.references	Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Alonso-Betanzos, A., Herrera, F.: An information theory-based feature selection framework for big data under apache spark. IEEE Trans. Syst. Man Cybern. Syst. 48(9), 1441–1453 (2018). https://doi.org/10.1109/TSMC.2017.2670926	es_ES
dc.description.references	Chaudhari, A.A., Mulay, P.: SCSI: real-time data analysis with cassandra and spark, pp. 237–264. Springer. https://doi.org/10.1007/978-981-13-0550-4_11 (2019)	es_ES
dc.description.references	Shyam, R., Bharathi Ganesh, H.B., Sachin Kumar, S., Poornachandran, P., Soman, K.P.: Apache spark a big data analytics platform for smart grid. Proced. Technol. 21, 171–178 (2015). https://doi.org/10.1016/j.protcy.2015.10.085	es_ES
dc.description.references	Shin, H., Lee, K., Kwon, H.: A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark. J. Supercomput. 78, 2556–2579 (2022). https://doi.org/10.1007/s11227-021-03946-7	es_ES
dc.description.references	Graur, D., Müller, I., Proffitt, M., Fourny, G., Watts, G.T., Alonso, G.: Evaluating query languages and systems for high-energy physics data. Proc. VLDB Endow. 15(2), 154–168 (2021). https://doi.org/10.14778/3489496.3489498	es_ES
dc.description.references	Feichtinger, D., Canal, P., Reed, C., Loizides, C., Ballintijn, M., Rademakers, F., Peters, A.J., Kickinger, G., Iwaszkiewicz, J., Ganis, G., Brun, R., Bellenot, B., Feichtinger, D., Canal, P., Reed, C., Loizides, C., Ballintijn, M., Rademakers, F., Peters, A.J., Kickinger, G., Iwaszkiewicz, J., Ganis, G., Brun, R., Bellenot, B.: PROOF - the parallel ROOT facility. In: 2006 15th IEEE International Conference on High Performance Distributed Computing, pp. 379–380. EDP sciences. https://doi.org/10.1109/HPDC.2006.1652193 (2006)	es_ES
dc.description.references	Chatrchyan, S., et al.: The CMS experiment at the CERN LHC. JINST 3, 08004 (2008). https://doi.org/10.1088/1748-0221/3/08/S08004	es_ES
dc.description.references	Sehrish, S., Kowalkowski, J., Paterno, M.: Spark and HPC for high energy physics data analyses. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1048–1057. IEEE, Lake Buena Vista, FL, USA. https://doi.org/10.1109/IPDPSW.2017.112(2017)	es_ES
dc.description.references	Gutsche, O., Cremonesi, M., Elmer, P., Jayatilaka, B., Kowalkowski, J., Pivarski, J., Sehrish, S., Surez, C.M., Svyatkovskiy, A., Tran, N.: Big data in HEP: a comprehensive use case study. J. Phys. Conf. Ser. 898, 072012 (2017). https://doi.org/10.1088/1742-6596/898/7/072012	es_ES
dc.description.references	Gutsche, O., Canali, L., Cremer, I., Cremonesi, M., Elmer, P., Fisk, I., Girone, M., Jayatilaka, B., Kowalkowski, J., Khristenko, V., Motesnitsalis, E., Pivarski, J., Sehrish, S., Surdy, K., Svyatkovskiy, A.: CMS analysis and data reduction with apache spark. J. Phys. Conf. Ser. 1085, 042030 (2018). https://doi.org/10.1088/1742-6596/1085/4/042030	es_ES
dc.description.references	Avati, V., Blaszkiewicz, M., Bocchi, E., Canali, L., Castro, D., Cervantes, J., Grzanka, L., Guiraud, E., Kaspar, J., Kothuri, P., Lamanna, M., Malawski, M., Mnich, A., Moscicki, J., Murali, S., Piparo, D., Tejedor, E.: Declarative big data analysis for high-energy physics: TOTEM use case. In: Yahyapour, R. (ed.) Euro-par 2019: Parallel Processing, pp. 241–255. Springer (2019)	es_ES
dc.description.references	Baranowski, Z., Kleszcz, E., Kothuri, P., Canali, L., Castellotti, R., Marquez, M.M., De Barros, N.G.M., Motesnitsalis, E., Mrowczynski, P., Duran, J.C.L.: Evolution of the hadoop platform and ecosystem for high energy physics. EPJ Web Conf. 214, 04058 (2019). https://doi.org/10.1051/epjconf/201921404058	es_ES
dc.description.references	Adamec, M., Attebury, G., Bloom, K., Bockelman, B., Lundstedt, C., Shadura, O., Thiltges, J.: Coffea-casa: an analysis facility prototype. EPJ Web Conf. 251, 02061 (2021). https://doi.org/10.1051/epjconf/202125102061	es_ES
dc.description.references	Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492	es_ES
dc.description.references	Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. SOCC ’13. Association for computing machinery. https://doi.org/10.1145/2523616.2523633 (2013)	es_ES
dc.description.references	Kubernetes: homepage. https://kubernetes.io/. Accessed 25 Nov 2022 (2022)	es_ES
dc.description.references	NumPy: homepage. https://numpy.org/. Accessed 25 Nov 2022 (2022)	es_ES
dc.description.references	Pandas: homepage. https://pandas.pydata.org/. Accessed 25 Nov 2022 (2022)	es_ES
dc.description.references	Nitzberg, B., Schopf, J.M., Jones, J.P.: PBS pro: grid computing and scheduling attributes, pp. 183–190. Kluwer academic publishers, USA (2004)	es_ES
dc.description.references	Hudak, P.: Conception, evolution, and application of functional programming languages. ACM Comput. Surv. 21(3), 359–411 (1989). https://doi.org/10.1145/72551.72554	es_ES
dc.description.references	Dozza, M., Bärgman, J., Lee, J.D.: Chunking: a procedure to improve naturalistic data analysis. Accident Anal. Prevention 58, 309–317 (2013). https://doi.org/10.1016/j.aap.2012.03.020	es_ES
dc.description.references	Rew, R.: Chunking data: why it matters. https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters (2013)	es_ES
dc.description.references	Padulano, V.E., Villanueva, J.C., Guiraud, E., Saavedra, E.T.: Distributed data analysis with ROOT RDataframe. EPJ Web Conf. 245, 03009 (2020). https://doi.org/10.1051/epjconf/202024503009	es_ES
dc.description.references	Dask: dask.delayed documentation. https://docs.dask.org/en/stable/delayed.html. Accessed 25 Nov 2022 (2022)	es_ES
dc.description.references	Spark: web UI. Accessed 25 NOv 2022. https://spark.apache.org/docs/latest/web-ui.html (2022)	es_ES
dc.description.references	Dask: dashboard diagnostics. Accessed 25 Nov 2022. https://docs.dask.org/en/stable/dashboard.html(2022)	es_ES
dc.description.references	Wunsch, S.: Analysis of the di-muon spectrum using data from the CMS detector taken in 2012. https://doi.org/10.7483/OPENDATA.CMS.AAR1.4NZQ(2019)	es_ES
dc.description.references	Padulano, V.E.: Test suite repository. Accessed 25 Nov 2022. https://github.com/vepadulano/distRDF_benchmarks (2022)	es_ES
dc.description.references	Spark: tuning guide. Accessed 25 Nov 2022. https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism (2022)	es_ES
dc.description.references	Gupta, A.: Building partitions for processing data files in apache spark. Accessed 25 Nov 2022. https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7 (2020)	es_ES
dc.description.references	Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic big data partitioning on apache spark, vol. 27, pp. 489–498. IOS Press. https://doi.org/10.3233/978-1-61499-621-7-489 (2016)	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)