Mostrar el registro sencillo del ítem
dc.contributor.author | Padulano, Vincenzo Eduardo | es_ES |
dc.contributor.author | Kabadzhov, Ivan Donchev | es_ES |
dc.contributor.author | Tejedor Saavedra, Enric | es_ES |
dc.contributor.author | Guiraud, Enrico | es_ES |
dc.contributor.author | Alonso-Jordá, Pedro | es_ES |
dc.date.accessioned | 2023-03-01T19:02:10Z | |
dc.date.available | 2023-03-01T19:02:10Z | |
dc.date.issued | 2023-02-10 | es_ES |
dc.identifier.issn | 1570-7873 | es_ES |
dc.identifier.uri | http://hdl.handle.net/10251/192206 | |
dc.description.abstract | [EN] The Large Hadron Collider (LHC) at CERN has generated a vast amount of information from physics events, reaching peaks of TB of data per day which are then sent to large storage facilities. Traditionally, data processing workflows in the High Energy Physics (HEP) field have leveraged grid computing resources. In this context, users have been responsible for manually parallelising the analysis, sending tasks to computing nodes and aggregating the partial results. Analysis environments in this field have had a common building block in the ROOT software framework. This is the de facto standard tool for storing, processing and visualising HEP data. ROOT offers a modern analysis tool called RDataFrame, which can parallelise computations from a single machine to a distributed cluster while hiding most of the scheduling and result aggregation complexity from users. This is currently done by leveraging Apache Spark as the distributed execution engine, but other alternatives are being explored by HEP research groups. Notably, Dask has rapidly gained popularity thanks to its ability to interface with batch queuing systems, widespread in HEP grid computing facilities. Furthermore, future upgrades of the LHC are expected to bring a dramatic increase in data volumes. This paper presents a novel implementation of the Dask backend for the distributed RDataFrame tool in order to address the aforementioned future trends. The scalability of the tool with both the new backend and the already available Spark backend is demonstrated for the first time on more than two thousand cores, testing a real HEP analysis. | es_ES |
dc.description.sponsorship | Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work benefited from the support of grant PID2020-113656RBC22 funded by Ministerio de Ciencia e Innovacion (Spain) MCIN/AEI/10.13039/501100011033. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Springer-Verlag | es_ES |
dc.relation.ispartof | Journal of Grid Computing | es_ES |
dc.rights | Reconocimiento (by) | es_ES |
dc.subject | Root | es_ES |
dc.subject | High energy physics | es_ES |
dc.subject | Distributed computing | es_ES |
dc.subject | Dask | es_ES |
dc.subject | Spark | es_ES |
dc.subject.classification | CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL | es_ES |
dc.title | Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.1007/s10723-023-09645-2 | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113656RB-C22/ES/COMPUTACION Y COMUNICACIONES DE ALTAS PRESTACIONES CONSCIENTES DEL CONSUMO ENERGETICO. APLICACIONES AL APRENDIZAJE PROFUNDO COMPUTACIONAL - UPV/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica | es_ES |
dc.description.bibliographicCitation | Padulano, VE.; Kabadzhov, ID.; Tejedor Saavedra, E.; Guiraud, E.; Alonso-Jordá, P. (2023). Leveraging state-of-the-art engines for large-scale data analysis in High Energy Physics. Journal of Grid Computing. 21:1-21. https://doi.org/10.1007/s10723-023-09645-2 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | https://doi.org/10.1007/s10723-023-09645-2 | es_ES |
dc.description.upvformatpinicio | 1 | es_ES |
dc.description.upvformatpfin | 21 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 21 | es_ES |
dc.relation.pasarela | S\482341 | es_ES |
dc.contributor.funder | Agencia Estatal de Investigación | es_ES |
dc.contributor.funder | Universitat Politècnica de València | |
dc.description.references | Apollinari, G., Béjar Alonso, I., Brüning, O., Fessia, P., Lamont, M., Rossi, L., Tavian, L.: High-luminosity large hadron collider (HL-LHC): technical design report V. 0.1. Technical report CERN. https://doi.org/10.23731/CYRM-2017-004 (2017) | es_ES |
dc.description.references | Elsen, E.: A roadmap for HEP software and computing R&D for the 2020s. Comput Softw Big Sci, vol 16(3). https://doi.org/10.1007/s41781-019-0031-6 (2019) | es_ES |
dc.description.references | Brun, R., Rademakers, F.: ROOT — an object oriented data analysis framework. Nuclear Instr. Methods Phys. Res. Section A Accelerators, Spectrometers, Detectors Assoc. Equip. 389(1), 81–86 (1997). https://doi.org/10.1016/S0168-9002(97)00048-X. New computing techniques in physics research V | es_ES |
dc.description.references | Blomer, J., Canal, P., Naumann, A., Piparo, D.: Evolution of the ROOT tree I/O. EPJ Web Conf. 245, 02030 (2020). https://doi.org/10.1051/epjconf/202024502030 | es_ES |
dc.description.references | Lopez-Gomez, J., Blomer, J.: RNTUple performance: status and outlook. arXiv:2022.09043. https://doi.org/10.48550 | es_ES |
dc.description.references | Piparo, D., Canal, P., Guiraud, E., Valls Pla, X., Ganis, G., Amadio, G., Naumann, A., Tejedor Saavedra, E.: RDAtaframe: easy parallel ROOT analysis at 100 threads. EPJ Web Conf. 214, 06029 (2019). https://doi.org/10.1051/epjconf/201921406029https://doi.org/10.1051/epjconf/201921406029 | es_ES |
dc.description.references | Bird, I.: Computing for the large hadron collider. Annu. Rev. Nucl. Part. Sci. 61 (1), 99–118 (2011). https://doi.org/10.1146/annurev-nucl-102010-130059 | es_ES |
dc.description.references | Team, R., Brann, K.A., Amadio, G., An, S., Bellenot, B., Blomer, J., Canal, P., Couet, O., Galli, M., Guiraud, E., Hageboeck, S., Linev, S., Vila, P.M., Moneta, L., Naumann, A., Tadel, A.M., Padulano, V.E., Rademakers, F., Shadura, O., Tadel, M., Saavedra, E.T., Pla, X.V., Vassilev, V., Wunsch, S.: Software challenges for HL-LHC data analysis. arXiv:2004.07675. 10.48550 (2020) | es_ES |
dc.description.references | Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor – a Distributed Job Scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux. MIT Press (2001) | es_ES |
dc.description.references | Jette, M., Dunlap, C., Garlick, J., Grondona, M.: Slurm: simple linux utility for resource management. Technical report, LLNL. https://www.osti.gov/biblio/15002962 (2002) | es_ES |
dc.description.references | Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10, p. 10. USENIX association. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets (2010) | es_ES |
dc.description.references | Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference, pp. 130–136. SciPy (2015) | es_ES |
dc.description.references | Rilee, M., Griessbaum, N., Kuo, K.-S., Frew, J., Wolfe, R.: STARE-based integrative analysis of diverse data using dask parallel programming demo paper. In: Proceedings of the 28th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’20, pp. 417–420. Association for computing machinery. https://doi.org/10.1145/3397536.3422346 (2020) | es_ES |
dc.description.references | Gharat, J., Kumar, B., Ragha, L., Barve, A., Jeelani, S.M., Clyne, J.: Development of NCL equivalent serial and parallel python routines for meteorological data analysis. Int. J. High Performance Comput. Appl., https://doi.org/10.1177/10943420221077110 (2022) | es_ES |
dc.description.references | Hamman, J.J., Rocklin, M., Abernathy, R.M.: Pangeo: a big-data ecosystem for scalable earth system science. In: 20th EGU General Assembly, EGU2018, p. 12146. The SAO/NASA astrophysics data system (ADS) (2018) | es_ES |
dc.description.references | Fan, S., Linke, M., Paraskevakos, I., Gowers, R.J., Gecht, M., Beckstein, O.: PMDA - Parallel molecular dynamics analysis. In: Calloway, C., Lippa, D., Niederhut, D., Shupe, D. (eds.) Proceedings of the 18th Python in Science Conference, pp. 134–142. SciPy. https://doi.org/10.25080/Majora-7ddc1dd1-013 (2019) | es_ES |
dc.description.references | Dask: dask.dataframe documentation. https://docs.dask.org/en/stable/dataframe.html . Accessed 25 Nov 2022 (2022) | es_ES |
dc.description.references | Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1, 145–164 (2016). https://doi.org/10.1007/s41060-016-0027-9 | es_ES |
dc.description.references | Khan, M.A., Karim, M.R., Kim, Y.: A two-stage big data analytics framework with real world applications using spark machine learning and long Short-Term memory network. Symmetry, vol. 10(10). https://doi.org/10.3390/sym10100485 (2018) | es_ES |
dc.description.references | Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Alonso-Betanzos, A., Herrera, F.: An information theory-based feature selection framework for big data under apache spark. IEEE Trans. Syst. Man Cybern. Syst. 48(9), 1441–1453 (2018). https://doi.org/10.1109/TSMC.2017.2670926 | es_ES |
dc.description.references | Chaudhari, A.A., Mulay, P.: SCSI: real-time data analysis with cassandra and spark, pp. 237–264. Springer. https://doi.org/10.1007/978-981-13-0550-4_11 (2019) | es_ES |
dc.description.references | Shyam, R., Bharathi Ganesh, H.B., Sachin Kumar, S., Poornachandran, P., Soman, K.P.: Apache spark a big data analytics platform for smart grid. Proced. Technol. 21, 171–178 (2015). https://doi.org/10.1016/j.protcy.2015.10.085 | es_ES |
dc.description.references | Shin, H., Lee, K., Kwon, H.: A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark. J. Supercomput. 78, 2556–2579 (2022). https://doi.org/10.1007/s11227-021-03946-7 | es_ES |
dc.description.references | Graur, D., Müller, I., Proffitt, M., Fourny, G., Watts, G.T., Alonso, G.: Evaluating query languages and systems for high-energy physics data. Proc. VLDB Endow. 15(2), 154–168 (2021). https://doi.org/10.14778/3489496.3489498 | es_ES |
dc.description.references | Feichtinger, D., Canal, P., Reed, C., Loizides, C., Ballintijn, M., Rademakers, F., Peters, A.J., Kickinger, G., Iwaszkiewicz, J., Ganis, G., Brun, R., Bellenot, B., Feichtinger, D., Canal, P., Reed, C., Loizides, C., Ballintijn, M., Rademakers, F., Peters, A.J., Kickinger, G., Iwaszkiewicz, J., Ganis, G., Brun, R., Bellenot, B.: PROOF - the parallel ROOT facility. In: 2006 15th IEEE International Conference on High Performance Distributed Computing, pp. 379–380. EDP sciences. https://doi.org/10.1109/HPDC.2006.1652193 (2006) | es_ES |
dc.description.references | Chatrchyan, S., et al.: The CMS experiment at the CERN LHC. JINST 3, 08004 (2008). https://doi.org/10.1088/1748-0221/3/08/S08004 | es_ES |
dc.description.references | Sehrish, S., Kowalkowski, J., Paterno, M.: Spark and HPC for high energy physics data analyses. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1048–1057. IEEE, Lake Buena Vista, FL, USA. https://doi.org/10.1109/IPDPSW.2017.112(2017) | es_ES |
dc.description.references | Gutsche, O., Cremonesi, M., Elmer, P., Jayatilaka, B., Kowalkowski, J., Pivarski, J., Sehrish, S., Surez, C.M., Svyatkovskiy, A., Tran, N.: Big data in HEP: a comprehensive use case study. J. Phys. Conf. Ser. 898, 072012 (2017). https://doi.org/10.1088/1742-6596/898/7/072012 | es_ES |
dc.description.references | Gutsche, O., Canali, L., Cremer, I., Cremonesi, M., Elmer, P., Fisk, I., Girone, M., Jayatilaka, B., Kowalkowski, J., Khristenko, V., Motesnitsalis, E., Pivarski, J., Sehrish, S., Surdy, K., Svyatkovskiy, A.: CMS analysis and data reduction with apache spark. J. Phys. Conf. Ser. 1085, 042030 (2018). https://doi.org/10.1088/1742-6596/1085/4/042030 | es_ES |
dc.description.references | Avati, V., Blaszkiewicz, M., Bocchi, E., Canali, L., Castro, D., Cervantes, J., Grzanka, L., Guiraud, E., Kaspar, J., Kothuri, P., Lamanna, M., Malawski, M., Mnich, A., Moscicki, J., Murali, S., Piparo, D., Tejedor, E.: Declarative big data analysis for high-energy physics: TOTEM use case. In: Yahyapour, R. (ed.) Euro-par 2019: Parallel Processing, pp. 241–255. Springer (2019) | es_ES |
dc.description.references | Baranowski, Z., Kleszcz, E., Kothuri, P., Canali, L., Castellotti, R., Marquez, M.M., De Barros, N.G.M., Motesnitsalis, E., Mrowczynski, P., Duran, J.C.L.: Evolution of the hadoop platform and ecosystem for high energy physics. EPJ Web Conf. 214, 04058 (2019). https://doi.org/10.1051/epjconf/201921404058 | es_ES |
dc.description.references | Adamec, M., Attebury, G., Bloom, K., Bockelman, B., Lundstedt, C., Shadura, O., Thiltges, J.: Coffea-casa: an analysis facility prototype. EPJ Web Conf. 251, 02061 (2021). https://doi.org/10.1051/epjconf/202125102061 | es_ES |
dc.description.references | Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492 | es_ES |
dc.description.references | Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. SOCC ’13. Association for computing machinery. https://doi.org/10.1145/2523616.2523633 (2013) | es_ES |
dc.description.references | Kubernetes: homepage. https://kubernetes.io/. Accessed 25 Nov 2022 (2022) | es_ES |
dc.description.references | NumPy: homepage. https://numpy.org/. Accessed 25 Nov 2022 (2022) | es_ES |
dc.description.references | Pandas: homepage. https://pandas.pydata.org/. Accessed 25 Nov 2022 (2022) | es_ES |
dc.description.references | Nitzberg, B., Schopf, J.M., Jones, J.P.: PBS pro: grid computing and scheduling attributes, pp. 183–190. Kluwer academic publishers, USA (2004) | es_ES |
dc.description.references | Hudak, P.: Conception, evolution, and application of functional programming languages. ACM Comput. Surv. 21(3), 359–411 (1989). https://doi.org/10.1145/72551.72554 | es_ES |
dc.description.references | Dozza, M., Bärgman, J., Lee, J.D.: Chunking: a procedure to improve naturalistic data analysis. Accident Anal. Prevention 58, 309–317 (2013). https://doi.org/10.1016/j.aap.2012.03.020 | es_ES |
dc.description.references | Rew, R.: Chunking data: why it matters. https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters (2013) | es_ES |
dc.description.references | Padulano, V.E., Villanueva, J.C., Guiraud, E., Saavedra, E.T.: Distributed data analysis with ROOT RDataframe. EPJ Web Conf. 245, 03009 (2020). https://doi.org/10.1051/epjconf/202024503009 | es_ES |
dc.description.references | Dask: dask.delayed documentation. https://docs.dask.org/en/stable/delayed.html. Accessed 25 Nov 2022 (2022) | es_ES |
dc.description.references | Spark: web UI. Accessed 25 NOv 2022. https://spark.apache.org/docs/latest/web-ui.html (2022) | es_ES |
dc.description.references | Dask: dashboard diagnostics. Accessed 25 Nov 2022. https://docs.dask.org/en/stable/dashboard.html(2022) | es_ES |
dc.description.references | Wunsch, S.: Analysis of the di-muon spectrum using data from the CMS detector taken in 2012. https://doi.org/10.7483/OPENDATA.CMS.AAR1.4NZQ(2019) | es_ES |
dc.description.references | Padulano, V.E.: Test suite repository. Accessed 25 Nov 2022. https://github.com/vepadulano/distRDF_benchmarks (2022) | es_ES |
dc.description.references | Spark: tuning guide. Accessed 25 Nov 2022. https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism (2022) | es_ES |
dc.description.references | Gupta, A.: Building partitions for processing data files in apache spark. Accessed 25 Nov 2022. https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7 (2020) | es_ES |
dc.description.references | Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic big data partitioning on apache spark, vol. 27, pp. 489–498. IOS Press. https://doi.org/10.3233/978-1-61499-621-7-489 (2016) | es_ES |