Mostrar el registro sencillo del ítem
dc.contributor.author | Castelló, Adrián | es_ES |
dc.contributor.author | Quintana-Ortí, Enrique S. | es_ES |
dc.contributor.author | Duato Marín, José Francisco | es_ES |
dc.date.accessioned | 2022-09-16T18:04:22Z | |
dc.date.available | 2022-09-16T18:04:22Z | |
dc.date.issued | 2021-12 | es_ES |
dc.identifier.issn | 1386-7857 | es_ES |
dc.identifier.uri | http://hdl.handle.net/10251/186225 | |
dc.description.abstract | [EN] TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter. | es_ES |
dc.description.sponsorship | Project TIN2017-82972-R of the Spanish Ministerio de Ciencia, Innovacion y Universidades. Agencia Valenciana de la Innovacion. Juan de la Cierva-Formacion project FJC2019-039222-I of the Ministerio de Ciencia, Innovacion y Universidades PRACE Preparatory Access project #2010PA5531. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Springer-Verlag | es_ES |
dc.relation.ispartof | Cluster Computing | es_ES |
dc.rights | Reserva de todos los derechos | es_ES |
dc.subject | Message Passing Interface (MPI) | es_ES |
dc.subject | Collective communication primitives | es_ES |
dc.subject | Allreduce | es_ES |
dc.subject | Deep learning | es_ES |
dc.subject | Distributed training | es_ES |
dc.subject.classification | ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES | es_ES |
dc.title | Accelerating distributed deep neural network training with pipelined MPI allreduce | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.1007/s10586-021-03370-9 | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MCIU//TIN2017-82972-R/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/PRACE//2010PA5531/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AGENCIA ESTATAL DE INVESTIGACION//FJC2019-039222-I//AYUDA JUAN DE LA CIERVA FORMACION-CASTELLO GIMENO, ADRIAN/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors | es_ES |
dc.description.bibliographicCitation | Castelló, A.; Quintana-Ortí, ES.; Duato Marín, JF. (2021). Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Computing. 24(4):3797-3813. https://doi.org/10.1007/s10586-021-03370-9 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | https://doi.org/10.1007/s10586-021-03370-9 | es_ES |
dc.description.upvformatpinicio | 3797 | es_ES |
dc.description.upvformatpfin | 3813 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 24 | es_ES |
dc.description.issue | 4 | es_ES |
dc.relation.pasarela | S\448457 | es_ES |
dc.contributor.funder | AGENCIA ESTATAL DE INVESTIGACION | es_ES |
dc.contributor.funder | Ministerio de Ciencia, Innovación y Universidades | es_ES |
dc.contributor.funder | Partnership for Advanced Computing in Europe AISBL | es_ES |
dc.description.references | Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016) | es_ES |
dc.description.references | Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018) | es_ES |
dc.description.references | Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112 | es_ES |
dc.description.references | Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018) | es_ES |
dc.description.references | Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019) | es_ES |
dc.description.references | Castelló, A., Catalán, M., Dolz, M.F., Mestre, J.I., Quintana-Ortí, E.S., Duato, J.: Evaluation of MPI Allreduce for distributed training of convolutional neural networks. In: 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (2021) | es_ES |
dc.description.references | Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Analysis of model parallelism for distributed neural networks. In: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, New York, NY, USA (2019). Association for Computing Machinery | es_ES |
dc.description.references | Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Theoretical scalability analysis of distributed deep convolutional neural networks. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 534–541 (2019) | es_ES |
dc.description.references | Chan, E., Heimlich, M., Purkayastha, A., van de Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput. 19(13), 1749–1783 (2007) | es_ES |
dc.description.references | Clarke, L., Glendinning, I., Hempel, R.: The MPI message passing interface standard. In: Programming Environments for Massively Parallel Distributed Systems, pp. 213–218. Springer, Berlin (1994) | es_ES |
dc.description.references | Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) | es_ES |
dc.description.references | Hasanov, K., Lastovetsky, A.: Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73(2), 713–725 (2017) | es_ES |
dc.description.references | He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) | es_ES |
dc.description.references | Google Inc. Tensorflow benchmarks. https://github.com/tensorflow/benchmarks | es_ES |
dc.description.references | Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., Hoefler, T.: Data movement is all you need: a case study on optimizing transformers (2020). arXiv:2007.00072 | es_ES |
dc.description.references | Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W.-k.: Full-duplex inter-group all-to-all broadcast algorithms with optimal bandwidth. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–10 (2018) | es_ES |
dc.description.references | Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W-k: Scalable algorithms for MPI intergroup Allgather and Allgatherv. Parallel Comput. 85, 220–230 (2019) | es_ES |
dc.description.references | Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-gpu cluster. Clust. Comput. 23(3), 2287–2300 (2020) | es_ES |
dc.description.references | Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Sciences, University of Toronto (2009) | es_ES |
dc.description.references | Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105, Curran Associates Inc. (2012) | es_ES |
dc.description.references | Kurnosov, M., Tokmasheva, E.: Shared memory based mpi broadcast algorithms for numa systems. In: Russian Supercomputing Days, pp. 473–485. Springer, Berlin (2020) | es_ES |
dc.description.references | Li, S., Hoefler, T., Chungjin, H., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014) | es_ES |
dc.description.references | Nguyen, T.T., Wahib, M., Takano, R.: Hierarchical distributed-memory multi-leader MPI\_Allreduce for deep learning workloads. In: 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW), pp. 216–222. IEEE (2018) | es_ES |
dc.description.references | Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019) | es_ES |
dc.description.references | Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow (2018). arXiv:1802.05799 | es_ES |
dc.description.references | Shalf, J.: HPC interconnects at the end of Moore’s Law. In: 2019 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2019) | es_ES |
dc.description.references | Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 | es_ES |
dc.description.references | Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI: The Complete Reference. The MIT Press, New York (1996) | es_ES |
dc.description.references | Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017) | es_ES |
dc.description.references | Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005) | es_ES |
dc.description.references | Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE International Conference on Local Computer Networks, 2003, pp. 548–557. IEEE (2003) | es_ES |
dc.description.references | Zhao, Y., Wang, L., Wu, W., Bosilca, G., Vuduc, R., Ye, J., Tang, W., Xu, Z.: Efficient communications in training large scale neural networks. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 110–116 (2017) | es_ES |
dc.description.references | Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using advanced vector extensions avx-512 for mpi reductions. In: 27th European MPI Users’ Group Meeting, pp. 1–10 (2020) | es_ES |