Apollinari, G., Béjar Alonso, I., Brüning, O., Fessia, P., Lamont, M., Rossi, L., Tavian, L.: High-luminosity large hadron collider (HL-LHC): technical design report V. 0.1. Technical report CERN. https://doi.org/10.23731/CYRM-2017-004 (2017)
Elsen, E.: A roadmap for HEP software and computing R&D for the 2020s. Comput Softw Big Sci, vol 16(3). https://doi.org/10.1007/s41781-019-0031-6 (2019)
Brun, R., Rademakers, F.: ROOT — an object oriented data analysis framework. Nuclear Instr. Methods Phys. Res. Section A Accelerators, Spectrometers, Detectors Assoc. Equip. 389(1), 81–86 (1997). https://doi.org/10.1016/S0168-9002(97)00048-X. New computing techniques in physics research V
[+]
Apollinari, G., Béjar Alonso, I., Brüning, O., Fessia, P., Lamont, M., Rossi, L., Tavian, L.: High-luminosity large hadron collider (HL-LHC): technical design report V. 0.1. Technical report CERN. https://doi.org/10.23731/CYRM-2017-004 (2017)
Elsen, E.: A roadmap for HEP software and computing R&D for the 2020s. Comput Softw Big Sci, vol 16(3). https://doi.org/10.1007/s41781-019-0031-6 (2019)
Brun, R., Rademakers, F.: ROOT — an object oriented data analysis framework. Nuclear Instr. Methods Phys. Res. Section A Accelerators, Spectrometers, Detectors Assoc. Equip. 389(1), 81–86 (1997). https://doi.org/10.1016/S0168-9002(97)00048-X. New computing techniques in physics research V
Blomer, J., Canal, P., Naumann, A., Piparo, D.: Evolution of the ROOT tree I/O. EPJ Web Conf. 245, 02030 (2020). https://doi.org/10.1051/epjconf/202024502030
Lopez-Gomez, J., Blomer, J.: RNTUple performance: status and outlook. arXiv:2022.09043. https://doi.org/10.48550
Piparo, D., Canal, P., Guiraud, E., Valls Pla, X., Ganis, G., Amadio, G., Naumann, A., Tejedor Saavedra, E.: RDAtaframe: easy parallel ROOT analysis at 100 threads. EPJ Web Conf. 214, 06029 (2019). https://doi.org/10.1051/epjconf/201921406029https://doi.org/10.1051/epjconf/201921406029
Bird, I.: Computing for the large hadron collider. Annu. Rev. Nucl. Part. Sci. 61 (1), 99–118 (2011). https://doi.org/10.1146/annurev-nucl-102010-130059
Team, R., Brann, K.A., Amadio, G., An, S., Bellenot, B., Blomer, J., Canal, P., Couet, O., Galli, M., Guiraud, E., Hageboeck, S., Linev, S., Vila, P.M., Moneta, L., Naumann, A., Tadel, A.M., Padulano, V.E., Rademakers, F., Shadura, O., Tadel, M., Saavedra, E.T., Pla, X.V., Vassilev, V., Wunsch, S.: Software challenges for HL-LHC data analysis. arXiv:2004.07675. 10.48550 (2020)
Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor – a Distributed Job Scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux. MIT Press (2001)
Jette, M., Dunlap, C., Garlick, J., Grondona, M.: Slurm: simple linux utility for resource management. Technical report, LLNL. https://www.osti.gov/biblio/15002962 (2002)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10, p. 10. USENIX association. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets (2010)
Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference, pp. 130–136. SciPy (2015)
Rilee, M., Griessbaum, N., Kuo, K.-S., Frew, J., Wolfe, R.: STARE-based integrative analysis of diverse data using dask parallel programming demo paper. In: Proceedings of the 28th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’20, pp. 417–420. Association for computing machinery. https://doi.org/10.1145/3397536.3422346 (2020)
Gharat, J., Kumar, B., Ragha, L., Barve, A., Jeelani, S.M., Clyne, J.: Development of NCL equivalent serial and parallel python routines for meteorological data analysis. Int. J. High Performance Comput. Appl., https://doi.org/10.1177/10943420221077110 (2022)
Hamman, J.J., Rocklin, M., Abernathy, R.M.: Pangeo: a big-data ecosystem for scalable earth system science. In: 20th EGU General Assembly, EGU2018, p. 12146. The SAO/NASA astrophysics data system (ADS) (2018)
Fan, S., Linke, M., Paraskevakos, I., Gowers, R.J., Gecht, M., Beckstein, O.: PMDA - Parallel molecular dynamics analysis. In: Calloway, C., Lippa, D., Niederhut, D., Shupe, D. (eds.) Proceedings of the 18th Python in Science Conference, pp. 134–142. SciPy. https://doi.org/10.25080/Majora-7ddc1dd1-013 (2019)
Dask: dask.dataframe documentation. https://docs.dask.org/en/stable/dataframe.html . Accessed 25 Nov 2022 (2022)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1, 145–164 (2016). https://doi.org/10.1007/s41060-016-0027-9
Khan, M.A., Karim, M.R., Kim, Y.: A two-stage big data analytics framework with real world applications using spark machine learning and long Short-Term memory network. Symmetry, vol. 10(10). https://doi.org/10.3390/sym10100485 (2018)
Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Alonso-Betanzos, A., Herrera, F.: An information theory-based feature selection framework for big data under apache spark. IEEE Trans. Syst. Man Cybern. Syst. 48(9), 1441–1453 (2018). https://doi.org/10.1109/TSMC.2017.2670926
Chaudhari, A.A., Mulay, P.: SCSI: real-time data analysis with cassandra and spark, pp. 237–264. Springer. https://doi.org/10.1007/978-981-13-0550-4_11 (2019)
Shyam, R., Bharathi Ganesh, H.B., Sachin Kumar, S., Poornachandran, P., Soman, K.P.: Apache spark a big data analytics platform for smart grid. Proced. Technol. 21, 171–178 (2015). https://doi.org/10.1016/j.protcy.2015.10.085
Shin, H., Lee, K., Kwon, H.: A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark. J. Supercomput. 78, 2556–2579 (2022). https://doi.org/10.1007/s11227-021-03946-7
Graur, D., Müller, I., Proffitt, M., Fourny, G., Watts, G.T., Alonso, G.: Evaluating query languages and systems for high-energy physics data. Proc. VLDB Endow. 15(2), 154–168 (2021). https://doi.org/10.14778/3489496.3489498
Feichtinger, D., Canal, P., Reed, C., Loizides, C., Ballintijn, M., Rademakers, F., Peters, A.J., Kickinger, G., Iwaszkiewicz, J., Ganis, G., Brun, R., Bellenot, B., Feichtinger, D., Canal, P., Reed, C., Loizides, C., Ballintijn, M., Rademakers, F., Peters, A.J., Kickinger, G., Iwaszkiewicz, J., Ganis, G., Brun, R., Bellenot, B.: PROOF - the parallel ROOT facility. In: 2006 15th IEEE International Conference on High Performance Distributed Computing, pp. 379–380. EDP sciences. https://doi.org/10.1109/HPDC.2006.1652193 (2006)
Chatrchyan, S., et al.: The CMS experiment at the CERN LHC. JINST 3, 08004 (2008). https://doi.org/10.1088/1748-0221/3/08/S08004
Sehrish, S., Kowalkowski, J., Paterno, M.: Spark and HPC for high energy physics data analyses. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1048–1057. IEEE, Lake Buena Vista, FL, USA. https://doi.org/10.1109/IPDPSW.2017.112(2017)
Gutsche, O., Cremonesi, M., Elmer, P., Jayatilaka, B., Kowalkowski, J., Pivarski, J., Sehrish, S., Surez, C.M., Svyatkovskiy, A., Tran, N.: Big data in HEP: a comprehensive use case study. J. Phys. Conf. Ser. 898, 072012 (2017). https://doi.org/10.1088/1742-6596/898/7/072012
Gutsche, O., Canali, L., Cremer, I., Cremonesi, M., Elmer, P., Fisk, I., Girone, M., Jayatilaka, B., Kowalkowski, J., Khristenko, V., Motesnitsalis, E., Pivarski, J., Sehrish, S., Surdy, K., Svyatkovskiy, A.: CMS analysis and data reduction with apache spark. J. Phys. Conf. Ser. 1085, 042030 (2018). https://doi.org/10.1088/1742-6596/1085/4/042030
Avati, V., Blaszkiewicz, M., Bocchi, E., Canali, L., Castro, D., Cervantes, J., Grzanka, L., Guiraud, E., Kaspar, J., Kothuri, P., Lamanna, M., Malawski, M., Mnich, A., Moscicki, J., Murali, S., Piparo, D., Tejedor, E.: Declarative big data analysis for high-energy physics: TOTEM use case. In: Yahyapour, R. (ed.) Euro-par 2019: Parallel Processing, pp. 241–255. Springer (2019)
Baranowski, Z., Kleszcz, E., Kothuri, P., Canali, L., Castellotti, R., Marquez, M.M., De Barros, N.G.M., Motesnitsalis, E., Mrowczynski, P., Duran, J.C.L.: Evolution of the hadoop platform and ecosystem for high energy physics. EPJ Web Conf. 214, 04058 (2019). https://doi.org/10.1051/epjconf/201921404058
Adamec, M., Attebury, G., Bloom, K., Bockelman, B., Lundstedt, C., Shadura, O., Thiltges, J.: Coffea-casa: an analysis facility prototype. EPJ Web Conf. 251, 02061 (2021). https://doi.org/10.1051/epjconf/202125102061
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. SOCC ’13. Association for computing machinery. https://doi.org/10.1145/2523616.2523633 (2013)
Kubernetes: homepage. https://kubernetes.io/. Accessed 25 Nov 2022 (2022)
NumPy: homepage. https://numpy.org/. Accessed 25 Nov 2022 (2022)
Pandas: homepage. https://pandas.pydata.org/. Accessed 25 Nov 2022 (2022)
Nitzberg, B., Schopf, J.M., Jones, J.P.: PBS pro: grid computing and scheduling attributes, pp. 183–190. Kluwer academic publishers, USA (2004)
Hudak, P.: Conception, evolution, and application of functional programming languages. ACM Comput. Surv. 21(3), 359–411 (1989). https://doi.org/10.1145/72551.72554
Dozza, M., Bärgman, J., Lee, J.D.: Chunking: a procedure to improve naturalistic data analysis. Accident Anal. Prevention 58, 309–317 (2013). https://doi.org/10.1016/j.aap.2012.03.020
Rew, R.: Chunking data: why it matters. https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters (2013)
Padulano, V.E., Villanueva, J.C., Guiraud, E., Saavedra, E.T.: Distributed data analysis with ROOT RDataframe. EPJ Web Conf. 245, 03009 (2020). https://doi.org/10.1051/epjconf/202024503009
Dask: dask.delayed documentation. https://docs.dask.org/en/stable/delayed.html. Accessed 25 Nov 2022 (2022)
Spark: web UI. Accessed 25 NOv 2022. https://spark.apache.org/docs/latest/web-ui.html (2022)
Dask: dashboard diagnostics. Accessed 25 Nov 2022. https://docs.dask.org/en/stable/dashboard.html(2022)
Wunsch, S.: Analysis of the di-muon spectrum using data from the CMS detector taken in 2012. https://doi.org/10.7483/OPENDATA.CMS.AAR1.4NZQ(2019)
Padulano, V.E.: Test suite repository. Accessed 25 Nov 2022. https://github.com/vepadulano/distRDF_benchmarks (2022)
Spark: tuning guide. Accessed 25 Nov 2022. https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism (2022)
Gupta, A.: Building partitions for processing data files in apache spark. Accessed 25 Nov 2022. https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7 (2020)
Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic big data partitioning on apache spark, vol. 27, pp. 489–498. IOS Press. https://doi.org/10.3233/978-1-61499-621-7-489 (2016)
[-]