Accelerating distributed deep neural network training with pipelined MPI allreduce

Castelló, Adrián; Quintana-Ortí, Enrique S.; Duato Marín, José Francisco

doi:10.1007/s10586-021-03370-9

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Accelerating distributed deep neural network training with pipelined MPI allreduce

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: CastelloQuintana- ...

Tamaño: 808.7Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: Castello&#7692021 ...

Tamaño: 1.068Mb

Formato: Desconocido

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Castelló, Adrián	es_ES
dc.contributor.author	Quintana-Ortí, Enrique S.	es_ES
dc.contributor.author	Duato Marín, José Francisco	es_ES
dc.date.accessioned	2022-09-16T18:04:22Z
dc.date.available	2022-09-16T18:04:22Z
dc.date.issued	2021-12	es_ES
dc.identifier.issn	1386-7857	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/186225
dc.description.abstract	[EN] TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.	es_ES
dc.description.sponsorship	Project TIN2017-82972-R of the Spanish Ministerio de Ciencia, Innovacion y Universidades. Agencia Valenciana de la Innovacion. Juan de la Cierva-Formacion project FJC2019-039222-I of the Ministerio de Ciencia, Innovacion y Universidades PRACE Preparatory Access project #2010PA5531.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	Cluster Computing	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Message Passing Interface (MPI)	es_ES
dc.subject	Collective communication primitives	es_ES
dc.subject	Allreduce	es_ES
dc.subject	Deep learning	es_ES
dc.subject	Distributed training	es_ES
dc.subject.classification	ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES	es_ES
dc.title	Accelerating distributed deep neural network training with pipelined MPI allreduce	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s10586-021-03370-9	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MCIU//TIN2017-82972-R/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/PRACE//2010PA5531/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AGENCIA ESTATAL DE INVESTIGACION//FJC2019-039222-I//AYUDA JUAN DE LA CIERVA FORMACION-CASTELLO GIMENO, ADRIAN/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors	es_ES
dc.description.bibliographicCitation	Castelló, A.; Quintana-Ortí, ES.; Duato Marín, JF. (2021). Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Computing. 24(4):3797-3813. https://doi.org/10.1007/s10586-021-03370-9	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s10586-021-03370-9	es_ES
dc.description.upvformatpinicio	3797	es_ES
dc.description.upvformatpfin	3813	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	24	es_ES
dc.description.issue	4	es_ES
dc.relation.pasarela	S\448457	es_ES
dc.contributor.funder	AGENCIA ESTATAL DE INVESTIGACION	es_ES
dc.contributor.funder	Ministerio de Ciencia, Innovación y Universidades	es_ES
dc.contributor.funder	Partnership for Advanced Computing in Europe AISBL	es_ES
dc.description.references	Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)	es_ES
dc.description.references	Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018)	es_ES
dc.description.references	Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112	es_ES
dc.description.references	Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018)	es_ES
dc.description.references	Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019)	es_ES
dc.description.references	Castelló, A., Catalán, M., Dolz, M.F., Mestre, J.I., Quintana-Ortí, E.S., Duato, J.: Evaluation of MPI Allreduce for distributed training of convolutional neural networks. In: 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (2021)	es_ES
dc.description.references	Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Analysis of model parallelism for distributed neural networks. In: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, New York, NY, USA (2019). Association for Computing Machinery	es_ES
dc.description.references	Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Theoretical scalability analysis of distributed deep convolutional neural networks. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 534–541 (2019)	es_ES
dc.description.references	Chan, E., Heimlich, M., Purkayastha, A., van de Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput. 19(13), 1749–1783 (2007)	es_ES
dc.description.references	Clarke, L., Glendinning, I., Hempel, R.: The MPI message passing interface standard. In: Programming Environments for Massively Parallel Distributed Systems, pp. 213–218. Springer, Berlin (1994)	es_ES
dc.description.references	Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)	es_ES
dc.description.references	Hasanov, K., Lastovetsky, A.: Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73(2), 713–725 (2017)	es_ES
dc.description.references	He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)	es_ES
dc.description.references	Google Inc. Tensorflow benchmarks. https://github.com/tensorflow/benchmarks	es_ES
dc.description.references	Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., Hoefler, T.: Data movement is all you need: a case study on optimizing transformers (2020). arXiv:2007.00072	es_ES
dc.description.references	Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W.-k.: Full-duplex inter-group all-to-all broadcast algorithms with optimal bandwidth. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–10 (2018)	es_ES
dc.description.references	Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W-k: Scalable algorithms for MPI intergroup Allgather and Allgatherv. Parallel Comput. 85, 220–230 (2019)	es_ES
dc.description.references	Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-gpu cluster. Clust. Comput. 23(3), 2287–2300 (2020)	es_ES
dc.description.references	Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Sciences, University of Toronto (2009)	es_ES
dc.description.references	Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105, Curran Associates Inc. (2012)	es_ES
dc.description.references	Kurnosov, M., Tokmasheva, E.: Shared memory based mpi broadcast algorithms for numa systems. In: Russian Supercomputing Days, pp. 473–485. Springer, Berlin (2020)	es_ES
dc.description.references	Li, S., Hoefler, T., Chungjin, H., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014)	es_ES
dc.description.references	Nguyen, T.T., Wahib, M., Takano, R.: Hierarchical distributed-memory multi-leader MPI\_Allreduce for deep learning workloads. In: 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW), pp. 216–222. IEEE (2018)	es_ES
dc.description.references	Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)	es_ES
dc.description.references	Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow (2018). arXiv:1802.05799	es_ES
dc.description.references	Shalf, J.: HPC interconnects at the end of Moore’s Law. In: 2019 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2019)	es_ES
dc.description.references	Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556	es_ES
dc.description.references	Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI: The Complete Reference. The MIT Press, New York (1996)	es_ES
dc.description.references	Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)	es_ES
dc.description.references	Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)	es_ES
dc.description.references	Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE International Conference on Local Computer Networks, 2003, pp. 548–557. IEEE (2003)	es_ES
dc.description.references	Zhao, Y., Wang, L., Wu, W., Bosilca, G., Vuduc, R., Ye, J., Tang, W., Xu, Z.: Efficient communications in training large scale neural networks. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 110–116 (2017)	es_ES
dc.description.references	Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using advanced vector extensions avx-512 for mpi reductions. In: 27th European MPI Users’ Group Meeting, pp. 1–10 (2020)	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Accelerating distributed deep neural network training with pipelined MPI allreduce

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Accelerating distributed deep neural network training with pipelined MPI allreduce

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)